Doing Classification Correctly
The Old Way
When there are more than 2 classes, we so far have suggested doing
- Assign one output node to each class.
- Set the target value of each node to be 1 if it is the correct
class and 0 otherwise.
- Use a linear network with a mean squared error function.
- Determine the network class prediction by picking the output node
with the largest value.
There are problems with this method. First, there is a disconnect between
the definition of the error function and the determination of the class. A minimum error does not necessary produce
the network with the largest number of correct prediction.
By varying the above method a little bit we can remove this inconsistency.
Let us start by changing the interpretation of the output:
The New Way
New Interpretation: The output of yi is interpreted as the probability that i
is the correct class. This means that:
- The output of each node must be between 0 and 1
- The sum of the outputs over all nodes must be equal to 1.
How do we achieve this? There are several things to vary.
- We can vary the activation function, for example, by using
a sigmoid. Sigmoids range continuously between 0 and 1. Is a sigmoid a good choice?
- We can vary the cost function. We need not use mean squared
error (MSE). What are our other options?
To decide, let's start by thinking about what makes sense intuitively.
With a linear network using gradient descent on a MSE function, we found that the weight updates were proportional
to the error (t-y). This seems to make sense. If we use a sigmoid activation function, we obtain a more complicated
See derivatives of activation
functions to see where this comes from.
This is not quite what we want. It turns out that there is a better error function/activation function
combination that gives us what we want.
Cross Entropy is defined as
where c is the number of classes (i.e. the number of output nodes).
This equation comes from information theory and is often applied
when the outputs (y) are interpreted as probabilities. We won't worry about where it comes from but let's see if
it makes sense for certain special cases.
- Suppose the network is trained perfectly so that the targets exactly
match the network output. Suppose class 3 is chosen. This means that output of node 3 is 1 (i.e. the probability
is 1 that 3 is correct) and the outputs of the other nodes are 0 (i.e. the probability is 0 that class != 3 is
correct). In this case do you see that the above equation is 0, as desired.
- Suppose the network gives an output of y=.5 for all of the output
i.e. that there is complete uncertainty about which is the correct class. It turns out that E has a maximum value
in this case.
- Thus, the more uncertain the network is, the larger the error
E. This is as it should be.
Softmax is defined as
where fi is the activation function of the ith
output node and c is the number of classes.
Note that this has the following good properties:
- it is always a number between 0 and 1
- when combined with the error function gives a weight update proportional
where dij = 0 if i=j and zero
otherwise. Note that if r is the correct class then tr = 1 and RHS of the above equation reduces to
(tr-yr)xs. If q!=r is the correct class then tr = 0 the above also
reduces to (tr-yr)xs. Thus we have
[Top] [Next: Optimizing] [Back to the first page]