| Reinforcement Learning |
Website Links For Reinforcement |
Information AboutReinforcement Learning |
| CATEGORIES ABOUT REINFORCEMENT LEARNING | |
| machine learning | |
|
The environment is typically formulated as a finite-state Markov Decision Process (MDP), and reinforcement learning algorithms for this context are highly related to Dynamic Programming techniques. State transition probabilities and reward probabilities in the MDP are typically stochastic but stationary over the course of the problem. Reinforcement learning differs from the Supervised Learning problem in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been mostly studied through the Multi-armed Bandit problem. Formally, the basic reinforcement learning model consists of: # a set of environment states ; # a set of actions ; and # a set of scalar "rewards" in . At each time , the agent perceives its state and the set of possible actions . It chooses an action and receives from the environment the new state and a reward . Based on these interactions, the reinforcement learning agent must develop a policy which maximizes the quantity for MDPs which have a terminal state, or the quantity for MDPs without terminal states (where is some "future reward" discounting factor between 0.0 and 1.0). Thus, reinforcement learning is particularly well suited to problems which include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including Robot Control , elevator scheduling, Telecommunications , Backgammon and Chess . ALGORITHMS After we have defined an appropriate return function to be maximized, we need to specify the algorithm that will be used to find the policy with the maximum return. There are two main approaches, the value function approach and the direct approach. The direct approach entails the following two steps: a) For each possible policy, sample returns while following it. b) Choose the policy with the largest expected return. One problem with this is that the number of policies can be extremely large, or even infinite. Another is that returns might be stochastic, in which case a large number of samples will be required to accurately estimate the return of each policy. The direct approach is the basis for the algorithms used in Evolutionary Robotics . The problems with the direct approach might be ameliorated if we assume some structure in the problem and somehow allow samples generated from one policy to influence the estimates made for another. Value function approaches do this by only maintaining a set of estimates of expected returns for one policy (usually either the current or the optimal one). In such approaches one attempts to estimate either the expected return starting from state and following thereafter, | ||
|   | :<math>Q(s,a) | E {Link without Title} </math> |
|   | :<math>Q(s,a) | \sum_{s'} V(s')P(s's,a),</math> |
|   | Given A Fixed Policy <math>\pi</math>, Estimating <math>E | "R\cdot" class="copylinks" target="_blank">{Link without Title} </math> for <math>\gamma=0</math> is trivial, as one only has to average the immediate rewards The most obvious way to do this for <math>\gamma>0</math> is to average the total return after each state However this type of Monte Carlo sampling requires the MDP to terminate |
|   | <math>E | "Rs_t]" class="copylinks" target="_blank">= r_t + \gamma E[Rs_{t+1} </math> |
|
|