Reinforcement Learning is about learning a mapping from states to a probability distribution over actions. This is called the policy.

**Policy** = p(s,a) = probability of taking action a when in state s

S = set of all states (assume finite)

s

_{t}= state at time tA(s

_{t}) = set of all possible actions given agent is in state s_{t}e Sa

_{t}= action at time tr

_{t}e R (reals) = reward at time t

At each timestep t=1,2,3,...

- the agent finds itself in a state s
_{t}e S and - on that basis chooses an action a
_{t}e A(s_{t}). - One timestep later, the agent receives a reward r
_{t}+1 and - finds itself in a new state s
_{t}+1.

The **return**, ret_{t}, is the total reward received starting at time t+1:

ret_{t}= r_{t+1} + r_{t+2} + r_{t+3} .... + r_{f}

where r_{f} is the reward at the final time step (can be infinite)

and the **discounted return** is

ret_{t}= r_{t+1} + g r_{t+2}
+ g^{2} r_{t+3} ....

where 0 <= g <= 1 is called the **discount
factor**.

We assume that the number of states and actions is finite. We then define the **state
transition probabilities** to be:

This is just the probability of transitioning from state s to state s' when action a has been taken.

**Expected Rewards**

The **value function for policy **p
is

The **action-value function for policy **p is

**Goal:** Find the policy that gives the greatest return over the long run. We say a policy
p is better than or equal to policy p' if V^{p}(s)
>= V^{p'}(s) for all s. There is always at least one such policy. Such a
policy it is called an optimal policy and is denoted by p*. Its corresponding value
function is called V*:

V*(s) = V^{p*}(s) = max_p V^{p}(s) , for all s

and the optimal action-value function

Q*(s,a) = Q^{p*}(s,a) = max_p Q^{p}(s,a) , for all s, a

The Bellman optimality equation is then

This equation has a unique solution. It is a system of equations with |S| equations and |S| unknowns. If P and R were known then, in principle, it can be solved using some method for solving systems of nonlinear equations. Once V* is known, the optimal policy is determined by always choosing the action that produces the largest V*.

[Top] [Next: ] [Back to the first page]