- Replace the step function in the with a continuous (differentiable) activation function, e.g linear
- For classification problems, use the step function only to determine the class and not to update the weights.
- Note: this is the same algorithm we saw for regression. All that really differs is how the classes are determined.

(one output node)

where

ti = desired target value associated with the i-th example

yi = output of network when the i-th input pattern is presented to network

- To train the network, we adjust the weights in the network so as to decrease the cost (this is where we require differentiability). This is called gradient descent.

If there are mor ethan 2 classes we could still use the same network but instead of having a binary target, we can let the target take on discrete values. For example of there ar 5 classes, we could have t=1,2,3,4,5 or t= -2,-1,0,1,2. It turns out, however, that the network has a much easier time if we have one output for class. We can think of each output node as trying to solve a binary problem (it is either in the given class or it isn't).