#Computed torque control is a control scheme used in motion control in robotics. | Explore Tumblr posts and blogs

juniperpublishers-robotics-blog · 5 years ago

Text

On the Alternatives of Lyapunov’s Direct Method in Adaptive Control Design- Juniper Publishers

Abstract

The prevailing methodology in designing adaptive controllers for strongly nonlinear systems is based on Lyapunov’s PhD Thesis he defended in 1892 to study the stability of motion of systems for the solution of the equations of motion of which no closed form analytical solutions exist. The adaptive robot controllers developed in the nineties of the 20thcentury guarantee global (often asymptotic) stability of the controlled system by using his ingenious Direct Method that introduces a Lyapunov function for the behavior of which relatively simple numerical limitations have to be proved. Though for various problem classes typical Lyapunov function candidates are available, the application of this method requires far more knowledge than the implementation of some algorithm. Besides requiring creative designer’s abilities, it often requires too much because it works with satisfactory conditionsinstead of necessary and satisfactoryones. To evade these difficulties, based on the firm mathematical background of constructing convergent iterative sequences by contractive maps in Banach spaces, an alternative of Lyapunov’s technique was so introduced for digital controllers in 2008 that during one control cycle only one step of the required iteration was done. Besides its simplicity the main advantage of this approach was the possible evasion of complete state estimation that normally is required in the Lyapunov function-based design. Though the convergence of the control sequence can be guaranteed only within a bounded basin, this approach seems to have considerable advantages. In the paper the current state of the art of this approach is briefly summarized.

Keywords: Adaptive control; Lyapunov function; Banach space; Fixed point lteration

Abbrevations: AC: Adaptive Control; AFC: Acceleration Feedback Controller; AID: Adaptive Inverse Dynamics Controller; CTC: Computed Torque Control; FPI: Fixed Point Iteration; MRAC: Model Reference Adaptive Control; OC: Optimal Control; PID:Proportional, Integrated, Derivative; RARC: Resolved Acceleration Rate Control; RHC: Receding Horizon Controller;SLAC: Slotine-Li Adaptive Controller;

Introduction

There is a wide class of model-based control approaches in which the available approximate dynamic model of the system to be controlled is “directly used” without “being inserted” into the mathematical framework of “Optimal Control” (OC). A classical example is the “Computed Torque Control” (CTC) for robots [1]. However, in the practice we have to cope with the problem of the imprecision (very often incompleteness) of the available system models (in robotics e.g. [1,2], modeling friction phenomena e.g. [3-7], in life sciences as modeling the glucose-insulin dynamics e.g. [8-11] or in anesthesia control e.g. [12-14]). Modeling such engines as aircraft turbojet motors is a quite complicated task that may need multiple model approach [15-18]. Further practical problem is the existence and the consequences of unknown and unpredictable “external disturbances”. A possible way of coping with these practical difficulties is designing “Adaptive Controllers” (AC) that somehow are able to observe and correct at least the effects of the modeling imprecisions by “learning”. Depending on the above available information on the model various adaptive methods can be elaborated. If we have precise information on the kinematics of a robot and only approximate information is available on the mass distribution of a robot arm made of rigid links the exact model parameters can be learned as in the case of the “Adaptive Inverse Dynamics” (AID) and the “Slotine-Li Adaptive Controller” (SLAC) for robots that are the direct adaptive extensions of the CTC control. An alternative approach is the adaptive modification of the feedback gains or terms [19]. The “Model Reference Adaptive Control” (MRAC) has double “intent”: a) it has to provide precise trajectory tracking, and b) for an outer, kinematics-based control loop they have to provide an illusion that instead of the actually controlled system, a so called “reference system” is under control (e.g. [20-22]).

The traditional approaches in controller design for strongly nonlinear systems are based on the PhD thesis by Lyapunov [23] that later was translated to Western languages (e.g. [24]). (In this context “strong nonlinearity” means that the use of a “linearized system model” in the vicinity of some “working point” is not satisfactory for practical use.) In Lyapunov’s “2nd” or “Direct Method” a Lyapunov function has to be constructed for the given particular problem (typical “candidates” are available for typical “problem classes”), and non-positiveness of the time-derivative of this function has to be proved. Besides the fact that the creation of the Lyapunov function is not a simple application of some algorithm –it is rather some creative art–, this method has various drawbacks as a) it works with “satisfactory conditions” instead of “necessary and satisfactory conditions” (i.e. often it requires too much as guaranteeing really not necessary conditions), b) its main emphasis is on global (asymptotic) stability of the motion of the controlled system without paying too much attention to the “initial” or “transient” phase of the controlled motion (for instance in life sciences a “transient” fluctuation can be lethal).

To cope with these difficulties alternatives of the Lyapunov function-based adaptive design were suggested in [25] in which the primary design intent is keeping at bay the initial “transients” by turning the task of finding the necessary control signal to iteratively so solving a fixed point problem [“Fixed Point Iteration” (FPI)] that in each digital control step only one step of the appropriate iteration can be realized. The mathematical antecedents of this approach were established in the 17th century (e.g. [26-28]), and its foundations in 1922 were extended to quite complicated spaces by Stefan Banach [29,30]. In [25] the novelty was the application of this approach to control problems. In contrast to the “traditional” “Resolved Acceleration Rate Control” (RARC) in which in the control of a 2nd order physical system only lower order derivatives or tracking error integrals are fed back (e.g. [19,31-33]) in this approach the measured “acceleration” signals are also used as in the “Acceleration Feedback Controllers” (AFC) (e.g. [34-38]).

In general, the most important “weak point” of the FPI-based approach is that it cannot guarantee global stability. The generated iterative control sequences converge to the solution of the control task only within a bounded basin that in principle can be left. To avoid this problem heuristic tuning rules were introduced for one of the little numbers of the adaptive parameters in [39-41]. In [42] essentially the same method was introduced in the design of a novel type of MRAC controllers the applicability of which was investigated by simulations for the control of various systems (e.g. [43-46]). Observing the fact that in the classical, Lyapunov function-based solutions as the AID and SLAC controllers the parameter tuning rule obtained from the Lyapunov function has a simple geometric interpretation that is independent of the Lyapunov function itself, the FPI-based solution was combined with the tuning rule of the original solutions used for learning the “exact dynamic parameters” of the controlled system. Alleviated from the burden of necessarily constructing some easily treatable quadratic Lyapunov function, the feedback provided by the FPIbased solution was directly used for parameter tuning. This solution resulted in precise trajectory tracking even in the initial phase of the learning process in which the available approximate model parameters still were very imprecise [47,48]. In the present paper certain novel results are summarized on the further development of the FPI-based approach.

Discussion and Results

The structure of the FPI-based adaptive control

The block scheme of the FPI-based adaptive controller is given in Figure 1 for a 2nd order dynamical system as e.g. a robot [48]. In this case the 2nd time-derivative of the generalized coordinates (joint coordinates). qcan be instantaneously set by the control torque or force Q On this basis, in the kinematic block an arbitrary desired joint acceleration can be designed that can drive the tracking error N q (t) − q(t) to 0 if it is realized. In the practice this joint acceleration cannot be realized due to the imprecisions in the dynamic model the CTC controller uses for the calculation of the necessary forces. Therefore, instead introducing this signal into the Approximate Model to calculate the necessary force its deformed version, is introduced into it. The necessary deformation iteratively is produced in the form of a sequence that is initiated by it, i.e. by During one digital control step one step of the iteration can be realized. If there are no special time-delay effects in the system, the contents of the delay boxes in Figure 1 exactly correspond to the cycle time of the controller Δt The “chain of operations” resulting in an observed realized response q(t) for the input mathematically approximately can be considered as a response since –though it depends on q and q − only slowly varies in comparison to that quite quickly can be modified. In the Adaptive Deformation Block of Figure 1 a function is used as in which [49]. Since due to the proportional, integral and derivative error feedback terms varies only slowly, we have an approximation as Regarding the convergence of this iteration, we have to take it into account that a Banach Space (accidentally denoted by B is a complete, linear, normed metric space. It is a convenient modeling tool that allows the use of simple norm estimations. Its completeness means that each self-convergent or Cauchy sequence has a limit point within the space. A mapping F :β β is contractive if ∃ a real number 0 ≤ K < 1 so that, It is easy to show that the sequence generated by a contractive map as is a Cauchy sequence: in the norm estimation given in (1)∀Lε in high order powers of K occur as n → ∞ therefore Due to the completeness of arbitrary element of the sequence n x according to (2) it holds that .

Consequently, it is enough to guarantee that the function F(.) is contractive, since in this case the sequence converges to the fixed point of this function if it is so constructed that its fixed point is the solution of the control task.

Construction of the adaptive function

In the original solution in [25] (3) was suggested for the special case qε IRwith three adaptive parameters Kc, Bc, and Ac.

Really, when we just have the solution of the control task and it is obtained that that is the solution is a fixed point. To obtain convergence in the vicinity of the fixed consider the 1st order Taylor series approximation as

leads to the approximation

On the basis of (5) it is easy to set the adaptive parameters for convergence: by choosing a great parameter and a small Ac it can be achieved that therefore the mapping is contractive and the sequence converges to the solution. The speed of convergence depends on setting Ac, and too great value can cause leaving the region of convergence.

For qε IRn (multiple variable systems) a different construction was introduced in [50,51] the convergence properties of which were more lucid than that of the multiple variable variant of (3):

in which the expression can be identified as the “response error in time t”, and with the Frobenius norm corresponds to the unit vector that is directed into the direction of the response error, ζ : IR  IR is a differentiable contractive map with the attractive fixed point ( )* * ζ x = x and c Aε IR is an adaptive control parameter. By using the same argumentation with the1st order Taylor series approximation it was shown in [52] that if the real part of each eigenvalue of is simultaneously positive or negative, an appropriate Ac parameter can be selected that guarantees convergence.

in which Qε IR2 denotes the control force and qε IR2 is the array of the generalized coordinates of the controlled system.

The parameter 1 σ , and 2 σ > 0 “modulate” the springs’ stiffness, the direction of the spring force is calculated by the use of the “signum” function as sign ( ) 1 01 q − L while its absolute value is The approximate and exact model parameter values are given in Table 1.

In the Kinematic Block for the integrated error the prescribed “tracking strategy” was that lead to a PID-type feedback that choice guarantees the convergence in the simulations ∧ = 6s−1 was chosen with ζ (x) = atanh (tanh (x + D) / 2),D = 0.3 in (6). The choice 5 10 1 c A = − × − resulted in good convergence. The Figure2–6 illustrate the effects of using the adaptive deformation. It is evident that the tracking precision was considerably improved without any chattering effect that are typical in the also simple Sliding Mode / Variable Structure Controllers (e.g. [53,54]. Figure 5 reveals that quite different control forces were applied in the non-adaptive and in the adaptive cases.

The essence of the adaptivity is revealed by Figure 6. In the non-adaptive case considerable PID corrections are added totherefore it considerably differs fromthat is identical to in the lack of adaptive deformation. However, the difference between the desired and the realized 2nd time-derivatives are quite considerable if no adaptive deformation is applied. In contrast to that, in the adaptive caseis in the vicinity of because only small PID corrections are needed if the trajectory tracking is precise. This desired value is very close to the realized 2nd time derivatives that considerably differ from the adaptively deformed value. That is, quite considerable adaptive deformation was needed for precise trajectory tracking due to the great modeling errors.

Further Possible Applications and Development

The applicability of the FPI-based adaptive control design methodology was investigated in various potential fields of application. In 2012 in [55] an adaptive emission control of freeway traffic was suggested by the use of the quasistationary solutions of an approximate hydrodynamic traffic model. In [56] an FPI-based adaptive control problem of relative order 4 was investigated. In [57] FPI-based control of the Hodgkin-Huxley Neuron was considered. In [58] the possible regulation of Propofol administration through wavelet-based FPI control in anaesthesia control was investigated.

In [59] the application of the FPI-based control in treating patients suffering from “Type 1 Diabetes Mellitus” was studied. The simplicity of the FPI-based method opened new prospects in the possible design of adaptive optimal controllers. In [60] the contradiction between the various requirements in OC was resolved in the case of underactuated mechanical systems in the following manner: instead constructing a “cost function contribution” to each state variable the motion of which needed control, consecutive time slots were introduced within which only one of the state variables was controlled with FPI-based adaptation. (The different sections may correspond to different relative order control tasks.) In [61] it was pointed out that the FPI-based control can be easily combined with the mathematical framework of the “Receding Horizon Controllers” (RHC) (e.g. [62]). (A combination with the Lyapunov function-based adaptive approach would be far less plausible and simple.) In [49] the applicability of this approach was introduced into the control of systems with time-delay. The possibility of fractional order kinematic trajectory tracking prescription in the FPI-based adaptive control was studied, too [63].

In [64] its applicability was investigated in treating angiogenic tumors. In [65,66] further simplification of the adaptive RHC control was considered in which the reduced gradient algorithm was replace by a FPI in finding the zero gradient of the “Auxiliary Function” of the problem. In [67] the applicability of the method was experimentally verified in the adaptive control of a pulse-width modulation driven brushless DC motor that did not have satisfactory documentation (FIT0441 Brushless DC Motor with Encoder and Driver) and was braked by external forces simply by periodically grabbing the rotating shaft by one’s two fingers. The solution was based on a simple Arduino UNO microcontroller with embedding the adaptive function defined in (3) into the motor’s control algorithm. In spite of using 2nd time-derivatives in the feedback no special noise filtering was applied. The measured and computed data was visualized by a common laptop. As it can be seen in Figure 7, the rotational speed was kept at almost constant (in spite of the very noisy measurement data), and the adaptive deformation and the control signal were well adapted to the external braking forces in harmony with the simulation results belonging to the “Illustrative Example” in subsection 2.3.

Figure 7: The experimental setup used for the verification of the FPI-based adaptive control in the case of a pulse-width modulated brushless electric DC motor; The nominal and the realized rotational speed (the average of the whole data set was 59:9383rpm, the nominal constant value was 60rpm); The “Desired” and adaptively “Deformed” 2nd timederivatives of the rotational speed; The control signal (from [67], courtesy of Tamás Faitli) In [68] the novel adaptive control approach was considered from the side of the Lyapunov function-based technique and it was found that it can be interpreted as a novel methodology that is able to drive the Lyapunov function near zero and keeping it in its vicinity afterwards. On this basis a new MRAC controller design was suggested in [69] that has similarity with the idea of the “Backstepping Controller” [70,71].

Conclusion

The FPI-based adaptive control approach was introduced at the Óbuda University with the aim of evading the mathematical difficulties and restrictions, furthermore the information need related to the traditional Lyapunov function-based design. Its main point was the transformation of the control task into a fixed-point problem that was iteratively solved on the firm mathematical basis of Banach’s fixed point theorem. In the center of the new approach, instead of the requirement of global stability, as the primary design intent, precise realization of a kinematically (kinetically) prescribed tracking error relaxation was placed. In contrast to the traditional soft computing approaches as fuzzy, neural network and neuro-fuzzy solutions that normally apply huge structures with ample number of the parameters of the universal approximators of the continuous multiple variable functions on the basis of Kolmogorov’s approximation theorem (e.g. [72-74]) this approach has only a few independent adaptive parameters that can be easily set and one of them can be tuned for maintaining the convergence of the control algorithm. It was shown that the simplicity of this approach allows its combination with more “traditional” approaches as that learning the exact model parameters of the controlled system and at various levels of the optimal controllers as the RFC control. On the basis of ample simulation investigations, it can be stated that the suggested approach has wide area of potential applications (in the control of mechanical devices, in life sciences, traffic control, etc.) where the presence of essential nonlinearities, the lack of precise and complete system models, and limited possibilities for obtaining information on the controlled system’s state are present as main difficulties. It seems to be expedient to invest more efforts into experimental investigations.

Acknowledgement

The Authors express their gratitudes to the Antal Bejczy Center for Intelligent Robotics, and the Doctoral School of Applied Informatics and Applied Mathematics for supporting their work.

For More Open Access Journals Please Click on: Juniper Publishers

Fore More Articles Please Visit: Robotics & Automation Engineering Journal

#Juniper Publishers #Open Access Journal #Robotics #Self Ruling Robots #Machine Learning

0 notes

djgblogger-blog · 7 years ago

Text

Learning robot objectives from physical human interaction

http://bit.ly/2C3WaNf

Humans physically interact with each other every day – from grabbing someone’s hand when they are about to spill their drink, to giving your friend a nudge to steer them in the right direction, physical interaction is an intuitive way to convey information about personal preferences and how to perform a task correctly.

So why aren’t we physically interacting with current robots the way we do with each other? Seamless physical interaction between a human and a robot requires a lot: lightweight robot designs, reliable torque or force sensors, safe and reactive control schemes, the ability to predict the intentions of human collaborators, and more! Luckily, robotics has made many advances in the design of personal robots specifically developed with humans in mind.

However, consider the example from the beginning where you grab your friend’s hand as they are about to spill their drink. Instead of your friend who is spilling, imagine it was a robot. Because state-of-the-art robot planning and control algorithms typically assume human physical interventions are disturbances, once you let go of the robot, it will resume its erroneous trajectory and continue spilling the drink. The key to this gap comes from how robots reason about physical interaction: instead of thinking about why the human physically intervened and replanning in accordance with what the human wants, most robots simply resume their original behavior after the interaction ends.

We argue that robots should treat physical human interaction as useful information about how they should be doing the task. We formalize reacting to physical interaction as an objective (or reward) learning problem and propose a solution that enables robots to change their behaviors while they are performing a task according to the information gained during these interactions.

Reasoning About Physical Interaction: Unknown Disturbance versus Intentional Information

The field of physical human-robot interaction (pHRI) studies the design, control, and planning problems that arise from close physical interaction between a human and a robot in a shared workspace. Prior research in pHRI has developed safe and responsive control methods to react to a physical interaction that happens while the robot is performing a task. Proposed by Hogan et. al., impedance control is one of the most commonly used methods to move a robot along a desired trajectory when there are people in the workspace. With this control method, the robot acts like a spring: it allows the person to push it, but moves back to an original desired position after the human stops applying forces. While this strategy is very fast and enables the robot to safely adapt to the human’s forces, the robot does not leverage these interventions to update its understanding of the task. Left alone, the robot would continue to perform the task in the same way as it had planned before any human interactions.

Why is this the case? It boils down to what assumptions the robot makes about its knowledge of the task and the meaning of the forces it senses. Typically, a robot is given a notion of its task in the form of an objective function. This objective function encodes rewards for different aspects of the task like “reach a goal at location X” or “move close to the table while staying far away from people”. The robot uses its objective function to produce a motion that best satisfies all the aspects of the task: for example, the robot would move toward goal X while choosing a path that is far from a human and close to the table. If the robot’s original objective function was correct, then any physical interaction is simply a disturbance from its correct path. Thus, the robot should allow the physical interaction to perturb it for safety purposes, but it will return to the original path it planned since it stubbornly believes it is correct.

In contrast, we argue that human interventions are often intentional and occur because the robot is doing something wrong. While the robot’s original behavior may have been optimal with respect to its pre-defined objective function, the fact that a human intervention was necessary implies that the original objective function was not quite right. Thus, physical human interactions are no longer disturbances but rather informative observations about what the robot’s true objective should be. With this in mind, we take inspiration from inverse reinforcement learning (IRL), where the robot observes some behavior (e.g., being pushed away from the table) and tries to infer an unknown objective function (e.g., “stay farther away from the table”). Note that while many IRL methods focus on the robot doing better the next time it performs the task, we focus on the robot completing its current task correctly.

Formalizing Reacting to pHRI

With our insight on physical human-robot interactions, we can formalize pHRI as a dynamical system, where the robot is unsure about the correct objective function and the human’s interactions provide it with information. This formalism defines a broad class of pHRI algorithms, which includes existing methods such as impedance control, and enables us to derive a novel online learning method.

We will focus on two parts of the formalism: (1) the structure of the objective function and (2) the observation model that lets the robot reason about the objective given a human physical interaction. Let x be the robot’s state (e.g., position and velocity) and u_R be the robot’s action (e.g., the torque it applies to its joints). The human can physically interact with the robot by applying an external torque, called u_H, and the robot moves to the next state via its dynamics, dot{x} = f(x,u_R+u_H).

The Robot Objective: Doing the Task Right with Minimal Human Interaction

In pHRI, we want the robot to learn from the human, but at the same time we do not want to overburden the human with constant physical intervention. Hence, we can write down an objective for the robot that optimizes both completing the task and minimizing the amount of interaction required, ultimately trading off between the two.

r(x,u_R,u_H;theta) = theta^{top} phi(x,u_R,u_H) - ||u_H||^2

Here, phi(x,u_R,u_H) encodes the task-related features (e.g., “distance to table”, “distance to human”, “distance to goal”) and theta determines the relative weight of each of these features. In the function, theta encapsulates the true objective – if the robot knew exactly how to weight all the aspects of its task, then it could compute how to perform the task optimally. However, this parameter is not known by the robot! Robots will not always know the right way to perform a task, and certainly not the human-preferred way.

The Observation Model: Inferring the Right Objective from Human Interaction

As we have argued, the robot should observe the human’s actions to infer the unknown task objective. To link the direct human forces that the robot measures with the objective function, the robot uses an observation model. Building on prior work in maximum entropy IRL as well as the Bolzmann distributions used in cognitive science models of human behavior, we model the human’s interventions as corrections which approximately maximize the robot’s expected reward at state x while taking action u_R+u_H. This expected reward emcompasses the immediate and future rewards and is captured by the Q-value:

P(u_H mid x, u_R; theta) propto e^{Q(x,u_R+u_H;theta)}

Intuitively, this model says that a human is more likely to choose a physical correction that, when combined with the robot’s action, leads to a desirable (i.e., high-reward) behavior.

Learning from Physical Human-Robot Interactions in Real-Time

Much like teaching another human, we expect that the robot will continuously learn while we interact with it. However, the learning framework that we have introduced requires that the robot solve a Partially Observable Markov Decision Process (POMDP); unfortunately, it is well known that solving POMDPs exactly is at best computationally expensive, and at worst intractable. Nonetheless, we can derive approximations from this formalism that can enable the robot to learn and act while humans are interacting.

To achieve such in-task learning, we make three approximations summarized below:

1) Separate estimating the true objective from solving for the optimal control policy. This means at every timestep, the robot updates its belief over possible theta values, and then re-plans an optimal control policy with the new distribution.

2) Separate planning from control. Computing an optimal control policy means computing the optimal action to take at every state in a continuous state, action, and belief space. Although re-computing a full optimal policy after every interaction is not tractable in real-time, we can re-compute an optimal trajectory from the current state in real-time. This means that the robot first plans a trajectory that best satisfies the current estimate of the objective, and then uses an impedance controller to track this trajectory. The use of impedance control here gives us the nice properties described earlier, where people can physically modify the robot’s state while still being safe during interaction.

Looking back at our estimation step, we will make a similar shift to trajectory space and modify our observation model to reflect this:

P(u_H mid x, u_R; theta) propto e^{Q(x,u_R+u_H;theta)} rightarrow P(xi_H mid xi_R; theta) propto e^{R(xi_H, xi_R;theta)}

Now, our observation model depends only on the cumulative reward R along a trajectory, which is easily computed by summing up the reward at each timestep. With this approximation, when reasoning about the true objective, the robot only has to consider the likelihood of a human’s preferred trajectory, xi_H, given the current trajectory it is executing, xi_R.

But what is the human’s preferred trajectory, xi_H? The robot only gets to directly measure the human’s force $u_H$. One way to infer what is the human’s preferred trajectory is by propagating the human’s force throughout the robot’s current trajectory, xi_R. Figure 1. builds up the trajectory deformation based on prior work from Losey and O’Malley, starting from the robot’s original trajectory, then the force application, and then the deformation to produce xi_H.

Fig 1. To infer the human’s prefered trajectory given the current planned trajectory, the robot first measures the human’s interaction force, $u_H$, and then smoothly deforms the waypoints near interaction point to get the human’s preferred trajectory, $xi_H$.

3) Plan with maximum a posteriori (MAP) estimate of theta. Finally, because theta is a continuous variable and potentially high-dimensional, and since our observation model is not Gaussian, rather than planning with the full belief over theta, we will plan only with the MAP estimate. We find that the MAP estimate under a 2nd order Taylor Series Expansion about the robot’s current trajectory with a Gaussian prior is equivalent to running online gradient descent:

theta^{t+1} = theta^{t} + alpha(Phi(xi^t_H) - Phi(xi^t_R))

At every timestep, the robot updates its estimate of theta in the direction of the cumulative feature difference, Phi(xi) = sum_{x^t in xi} phi(x^t), between its current optimal trajectory and the human’s preferred trajectory. In the Learning from Demonstration literature, this update rule is analogous to online Max Margin Planning; it is also analogous to coactive learning, where the user modifies waypoints for the current task to teach a reward function for future tasks.

Ultimately, putting these three steps together leads us to an elegant approximate solution to the original POMDP. At every timestep, the robot plans a trajectory xi_R and begins to move. The human can physically interact, enabling the robot to sense their force $u_H$. The robot uses the human’s force to deform its original trajectory and produce the human’s desired trajectory, xi_H. Then the robot reasons about what aspects of the task are different between its original and the human’s preferred trajectory, and updates theta in the direction of that difference. Using the new feature weights, the robot replans a trajectory that better aligns with the human’s preferences.

For a more thorough description of our formalism and approximations, please see our recent paper from the 2017 Conference on Robot Learning.

Learning from Humans in the Real World

To evaluate the benefits of in-task learning on a real personal robot, we recruited 10 participants for a user study. Each participant interacted with the robot running our proposed online learning method as well as a baseline where the robot did not learn from physical interaction and simply ran impedance control.

Fig 2. shows the three experimental household manipulation tasks, in each of which the robot started with an initially incorrect objective that participants had to correct. For example, the robot would move a cup from the shelf to the table, but without worrying about tilting the cup (perhaps not noticing that there is liquid inside).

Fig 2. Trajectory generated with initial objective marked in black, and the desired trajectory from true objective in blue. Participants need to correct the robot to teach it to hold the cup upright (left), move closer to the table (center), and avoid going over the laptop (right).

We measured the robot’s performance with respect to the true objective, the total effort the participant exerted, the total amount of interaction time, and the responses of a 7-point Likert scale survey.

In Task 1, participants have to physically intervene when they see the robot tilting the cup and teach the robot to keep the cup upright.

Task 2 had participants teaching the robot to move closer to the table.

For Task 3, the robot’s original trajectory goes over a laptop. Participants have to physically teach the robot to move around the laptop instead of over it.

The results of our user studies suggest that learning from physical interaction leads to better robot task performance with less human effort. Participants were able to get the robot to execute the correct behavior faster with less effort and interaction time when the robot was actively learning from their interactions during the task. Additionally, participants believed the robot understood their preferences more, took less effort to interact with, and was a more collaborative partner.

Fig 3. Learning from interaction significantly outperformed not learning for each of our objective measures, including task cost, human effort, interaction time.

Ultimately, we propose that robots should not treat human interactions as disturbances, but rather as informative actions. We showed that robots imbued with this sort of reasoning are capable of updating their understanding of the task they are performing and completing it correctly, rather than relying on people to guide them until the task is done.

This work is merely a step in exploring learning robot objectives from pHRI. Many open questions remain including developing solutions that can handle dynamical aspects (like preferences about the timing of the motion) and how and when to generalize learned objectives to new tasks. Additionally, robot reward functions will often have many task-related features and human interactions may only give information about a certain subset of relevant weights. Our recent work in HRI 2018 studied how a robot can disambiguate what the person is trying to correct by learning about only a single feature weight at a time. Overall, not only do we need algorithms that can learn from physical interaction with humans, but these methods must also reason about the inherent difficulties humans experience when trying to kinesthetically teach a complex – and possibly unfamiliar – robotic system.

Thank you to Dylan Losey and Anca Dragan for their helpful feedback in writing this blog post. This article was initially published on the BAIR blog, and appears here with the authors’ permission.

This post is based on the following papers:

A. Bajcsy* , D.P. Losey*, M.K. O’Malley, and A.D. Dragan. Learning Robot Objectives from Physical Human Robot Interaction. Conference on Robot Learning (CoRL), 2017.

A. Bajcsy , D.P. Losey, M.K. O’Malley, and A.D. Dragan. Learning from Physical Human Corrections, One Feature at a Time. International Conference on Human-Robot Interaction (HRI), 2018.

0 notes