In supervised learning we have assumed that there is a target output value for each input value. However, in many situations, there is less detailed information available. In extreme situations, there is only a single bit of information after a long sequence of inputs telling whether the output is right or wrong. Reinforcement learning is one method developed to deal with such situations.
Reinforcement learning (RL) is a kind of supervised learning in that some feedback from the environment is given. However the feedback signal is only evaluative, not instructive. Reinforcement learning is often called learning with a critic as opposed to learning with a teacher.
Humans learn by interacting with the environment. When a baby plays, it waves its arms around, touches things, tastes things, etc. There is no explicit teacher but there is a sensori-motor connection to its environment. Such a connection provides information about cause and effect, the consequence of actions, and what to do to achieve goals.
In the most general case, the environment may itself be governed by a complicated dynamical process. Both reinforcement signals and input patterns may depend arbitrarily on the past history of the networks's output.
A network designed to play chess would receive a reinforcement signal (win or lose) after a long sequence of moves. The question that arises is: How do we assign credit or blame individually to each move in a sequence that leads to an eventual victory or loss?
So far in this course we have not discussed the issue of planning. The networks we have seen are simply learning a direct relationship between an input and an output. RL is our first look at networks that in some sense decide a course of action by considering possible future actions before they are actually experienced.