Summary of Linear Nets

Regression:
- uses a one-layer linear network (activation function is identity)
- uses MSE cost function
- uses gradient decent learning
Classification - Perceptron Learning
- uses a one-layer network with a binary step activation function
- uses MSE cost function
- uses the perceptron learning algorithm (identical with gradient descent when targets are +1 and -1)
Classification - Delta Rule
- uses a one-layer network with a linear activation function
- uses MSE cost function
- uses gradient descent
- the network chooses the class by picking the output node with the largest output
Classification - Gradient Descent (the right way)
- uses a one-layer network with a softmax activation function
- uses the cross entropy error function
- outputs are interpreted as probabilities
- the network chooses the class with the highest probability

Batch
- At each iteration, the gradient is computed by averaging over all inputs
Online (stochastic)
- At each iteration, the gradient is estimated by picking one (or a small number) of inputs.
- Because the gradient is only being esitimated, there is a lot of noise in the weight updates. The error comes down quicly but then tends to jiggle around. To remove this noise one can switch to batch at the point where the error levels out and or to continue to use online but to decrease the learning rate (called annealing the learning rate). One way annealing is to use m = m₀/t where m₀ us the originial learning rate and t is the number of timesteps after annealing is turned on.

Learning rates that are too big cause the algorithm to diverge
Learning rates that are too small cause the algorithm to converge very slowly.
The optimal learning rate for linear networks is r/(H^-1) where H is the Hessian and is defined as the second derivative of the cost function with respect to the weights. Unfortunately, this is a matrix whose inverse can be costly to compute.
The best learning rate for batch is the inverse Hessian.
More details if you are interested:
- The next best thing is to use a separate learning rate for each weight. If the Hessian is diagonal these learning rates are just one over the eigenvalues of the Hessian. Fat chance that the hessian is diagonal though!
- If using a single scalar learning then the best one to use is 1 over the largest eigenvalue of the Hessian. There are fairly inexpensive algorithms for estimating this. However, many people just use the ol' brute force method of picking the learning rate - trial and error.
- For linear networks the Hessian is < x x^T> and is independent of the weights. For nonlinear networks (i.e. any network that has an activation function that isn't the identity), the Hessian depends on the value of the weights and so changes everytime the weights are updated - arrgh! That is why people love the trial and error approach.

For regression, we can only fit a straight line through the data points. Many problems are not linear.
For classification, we can only lay down linear boundaries between classes. This is often inadequate for most real world problems.