Components of Reinforcement Learning

Reinforcement learning has 3 basic components:

Each action is associated with a reward. The objective is for the agent to choose actions so as to maximize the expected reward over some period of time.

Example: The n-Armed Bandit

Java Simulation

There are n levers that can be pulled.

The action at each step is to choose a lever to pull.

The rewards are the payoffs for hitting the jackpot. Each arm has some average reward, called it's value. If you know the value then the solution is trivial: always pick the lever with the largest value.

What if you don't know the values of any of the arms? What is the best approach for estimating the value while at the same time maximizing your reward?

Greedy Approach: Policy: Always pick the arm with the largest estimated value. This is called exploiting your current knowledge.

Non-Greedy Approach: If you select a nongreedy approach then you are said to be exploring.

Balanced Approach: Choose a balance between exploration and exploitation. The balance partly depends on how many plays you get. If you have 1 play then the best approach is exploitation. However, there are many plays you will need some combination. The reward will be lower in the short term but higher in the long run.


Q*(a) = true actual value of taking an action a

Qt(a) = estimated value of taking an action a = (sum of rewards)/(number of steps)

As t->infinity, Qt(a) -> Q*(a)

Example: A simple policy would be to take the greedy choice most of the time but every now and then (with probability e), randomly select an action. How do we choose e? select

Components of the Agent

A reinforcement learning agent generally has 4 basic components:


The policy is the decision making function of the agent. It specifies what action the agent should take in any of the situations it might encounter. This is the core of the agent. The other components serve only to change and improve the policy.

Reward Function

The reward function defines the goal of the RL agent. It maps the state of the environment to a single number, a reward, indicating the intrinsic desirability of the state. The agent's objective is to maximize the total reward it receives in the long run.

Value function

The value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward the agent can expect to accumulate over the future when starting from the current state.

Rewards determine immediate desirability while value indicates the long term desirability.

In analogy to humans, rewards are immediate pleasure (if high reward) or pain (if low) whereas values correspond to more refined far-sighted judgement of how pleased or displeased we are that our environment is in a particular state.

Most of the methods we will discuss are centered around forming and improving approximate value functions.


The model of the environment or external world should mimic the behavior of the environment. For example, given a situation and action, the model might predict the resultant next state and next reward. The model often takes up the largest storage space. If there are S states and A actions then a complete model will take up a space proportional to S x S x A because it maps state-action pairs to probability distributions over states. By contrast, the reward and value functions might just map states to real numbers and thus be of size S.

 [Top] [Next: Terminology] [Back to the first page]