Reinforcement Learning

Learning with a Critic

In supervised learning we have assumed that there is a target output value for each input value. However, in many situations, there is less detailed information available. In extreme situations, there is only a single bit of information after a long sequence of inputs telling whether the output is right or wrong. Reinforcement learning is one method developed to deal with such situations.

Reinforcement learning (RL) is a kind of supervised learning in that some feedback from the environment is given. However the feedback signal is only evaluative, not instructive. Reinforcement learning is often called learning with a critic as opposed to learning with a teacher.

Learning from Interaction

Humans learn by interacting with the environment. When a baby plays, it waves its arms around, touches things, tastes things, etc. There is no explicit teacher but there is a sensori-motor connection to its environment. Such a connection provides information about cause and effect, the consequence of actions, and what to do to achieve goals.

Learning from interaction with our environment is a fundamental idea underlying most theories of learning.

RL has rich roots in the psychology of animal learning, from where it gets its name.

The growing interest in RL comes in part from the desire to build intelligent systems that must operate in dynamically changing real- world environments. Robotics is the common example.

Environment

In RL, it is common to think explicitly of a network functioning in an environment. The environment supplies inputs to the network, receives output, and then provides a reinforcement signal.

In the most general case, the environment may itself be governed by a complicated dynamical process. Both reinforcement signals and input patterns may depend arbitrarily on the past history of the networks's output.

The classic problem is in game theory, where the "environment" is actually another player or players.

Temporal Credit Assignment Problem

A network designed to play chess would receive a reinforcement signal (win or lose) after a long sequence of moves. The question that arises is: How do we assign credit or blame individually to each move in a sequence that leads to an eventual victory or loss?

This is called the temporal credit assignment problem in contrast with the structural credit problem where we must attribute network error to different weights.

Learning and Planning

So far in this course we have not discussed the issue of planning. The networks we have seen are simply learning a direct relationship between an input and an output. RL is our first look at networks that in some sense decide a course of action by considering possible future actions before they are actually experienced.

Related Work

RL is closely related to

dynamic programming methods
state-space planning methods used in AI

Exploration vs Exploitation

RL is learning what to do - how to map situations to actions - so as to maximize a scalar reward signal.

There are two important features:

trial-and-error search:
the learner is not told what actions to take
delayed reward:
actions can affect not only the immediate reward but also all subsequent rewards

There is always a trade-off in

exploration: discovery new actions, and
exploitation: using what it currently knows to obtain the a reward