When there are more than 2 classes, we so far have suggested doing the following:

- Assign one output node to each class.
- Set the target value of each node to be 1 if it is the correct class and 0 otherwise.
- Use a linear network with a mean squared error function.
- Determine the network class prediction by picking the output node with the largest value.

There are problems with this method. First, there is a disconnect between the definition of the error function and the determination of the class. A minimum error does not necessary produce the network with the largest number of correct prediction.

By varying the above method a little bit we can remove this inconsistency. Let us start by changing the interpretation of the output:

**New Interpretation: **The output of y_{i} is interpreted as the probability that i
is the correct class. This means that:

- The output of each node must be between 0 and 1
- The sum of the outputs over all nodes must be equal to 1.

How do we achieve this? There are several things to vary.

- We can vary the
*activation function*, for example, by using a sigmoid. Sigmoids range continuously between 0 and 1. Is a sigmoid a good choice? - We can vary the
*cost function*. We need not use mean squared error (MSE). What are our other options?

To decide, let's start by thinking about what makes sense intuitively. With a linear network using gradient descent on a MSE function, we found that the weight updates were proportional to the error (t-y). This seems to make sense. If we use a sigmoid activation function, we obtain a more complicated formula:

See derivatives of activation functions to see where this comes from.

This is not quite what we want. It turns out that there is a better error function/activation function combination that gives us what we want.

**Cross Entropy** is defined as

where c is the number of classes (i.e. the number of output nodes).

This equation comes from *information theory* and is often applied
when the outputs (y) are interpreted as probabilities. We won't worry about where it comes from but let's see if
it makes sense for certain special cases.

- Suppose the network is trained perfectly so that the targets exactly match the network output. Suppose class 3 is chosen. This means that output of node 3 is 1 (i.e. the probability is 1 that 3 is correct) and the outputs of the other nodes are 0 (i.e. the probability is 0 that class != 3 is correct). In this case do you see that the above equation is 0, as desired.
- Suppose the network gives an output of y=.5 for all of the output i.e. that there is complete uncertainty about which is the correct class. It turns out that E has a maximum value in this case.
- Thus, the more uncertain the network is, the larger the error E. This is as it should be.

Softmax is defined as

where f_{i} is the activation function of the i^{th}
output node and c is the number of classes.

Note that this has the following good properties:

- it is always a number between 0 and 1
- when combined with the error function gives a weight update proportional to (t-y).

where d_{ij} = 0 if i=j and zero
otherwise. Note that if r is the correct class then t_{r} = 1 and RHS of the above equation reduces to
(t_{r}-y_{r})x_{s}. If q!=r is the correct class then t_{r} = 0 the above also
reduces to (t_{r}-y_{r})x_{s}. Thus we have