Non-Supervised Learning

It is possible to use neural networks to learn about data that contains neither target outputs nor class labels. There are many tricks for getting error signals in such non-supervised settings; here we'll briefly discuss a few of the most common approaches: autoassociation, time series prediction, and reinforcement learning.


Autoassociation is based on a simple idea: if you have inputs but no targets, just use the inputs as targets. An autoassociator network thus tries to learn the identity function. This is only non-trivial if the hidden layer forms an information bottleneck - contains less units than the input (output) layer, so that the network must perform dimensionality reduction (a form of data compression).

A linear autoassociator trained with sum-squared error in effect performs principal component analysis (PCA), a well-known statistical technique. PCA extracts the subspace (directions) of highest variance from the data. As was the case with regression, the linear neural network offers no direct advantage over known statistical methods, but it does suggest an interesting nonlinear generalization:

nonlinear autoassociator

This nonlinear autoassociator includes a hidden layer in both the encoder and the decoder part of the network. Together with the linear bottleneck layer, this gives a network with at least 3 hidden layers. Such a deep network should be preconditioned if it is to learn successfully.

Time Series Prediction

When the input data x forms a temporal series, an important task is to predict the next point: the weather tomorrow, the stock market 5 minutes from now, and so on. We can (attempt to) do this with a feedforward network by using time-delay embedding: at time t, we give the network x(t), x(t-1), ... x(t-d) as input, and try to predict x(t+1) at the output. After propagating activity forward to make the prediction, we wait for the actual value of x(t+1) to come in before calculating and backpropagating the error. Like all neural network architecture parameters, the dimension d of the embedding is an important but difficult choice.

A more powerful (but also more complicated) way to model a time series is to use recurrent neural networks.

Reinforcement Learning

Sometimes we are faced with the problem of delayed reward: rather than being told the correct answer for each input pattern immediately, we may only occasionally get a positive or negative reinforcement signal to tell us whether the entire sequence of actions leading up to this was good or bad. Reinforcement learning provides ways to get a continuous error signal in such situations.

Q-learning associates an expected utility (the Q-value) with each action possible in a particular state. If at time t we are in state s(t) and decide to perform action a(t), the corresponding Q-value is updated as follows:

Q(s(t), a(t)) = r(t) + gamma max_a Q(s(t+1), a)

where r(t) is the instantaneous reward resulting from our action, s(t+1) is the state that it led to, a are all possible actions in that state, and gamma <= 1 is a discount factor that leads us to prefer instantaneous over delayed rewards.

A common way to implement Q-learning for small problems is to maintain a table of Q-values for all possible state/action pairs. For large problems, however, it is often impossible to keep such a large table in memory, let alone learn its entries in reasonable time. In such cases a neural network can provide a compact approximation of the Q-value function. Such a network takes the state s(t) as its input, and has an output ya for each possible action. To learn the Q-value Q(s(t), a(t)), it uses the right-hand side of the above Q-iteration as a target:

delta_a(t) = r(t) + gamma max_a y_a - y_a(t)

Note that since we require the network's outputs at time t+1 in order to calculate its error signal at time t, we must keep a one-step memory of all input and hidden node activity, as well as the most recent action. The error signal is applied only to the output corresponding to that action; all other output nodes receive no error (they are "don't cares").

TD-learning is a variation that assigns utility values to states alone rather than state/action pairs. This means that search must be used to determine the value of the best successor state. TD(lambda) replaces the one-step memory with an exponential average of the network's gradient; this is similar to momentum, and can help speed the transport of delayed reward signals across large temporal distances.

One of the most successful applications of neural networks is TD-Gammon, a network that used TD(lambda) to learn the game of backgammon from scratch, by playing only against itself. TD-Gammon is now the world's strongest backgammon program, and plays at the level of human grandmasters.

[Top] [Next: Recurrent neural networks] [Back to the first page]