Understanding Neural Networks and the Training Process
Training a neural network involves a lot of mathematics, including linear algebra and multivariate calculus, and a lot of computation. The purpose of this short article is to illustrate the concepts involved in training without diving deeply into the details of the mathematics. Neural networks—and, more broadly, most predictive models—perform either classification, regression, or some combination of the two.
The primary difference between these types of models is the nature of the output; the output of a classification model is categorical and the output of a regression model is numerical. Since the internal mechanics of these models are nearly identical, we focus our attention in this article on classification models, though we will point out (when practical) a few of the subtle differences along the way.
What Happens Inside a Neural Network?
A neural network is a sequence of one or more functions, often referred to as layers. The functions of a neural network contain learnable parameters—these are values or variables that change as the model is trained to improve the model's output. Broadly speaking, the functions inside a neural network serve one of two purposes:
- Score the data
- Transform the data such that the scoring function(s) perform better
Scoring Data Using Projections
The final layer of a neural network usually contains one or more learned vectors. For a classification model, there will be one vector representing each category into which the model will classify data; the vectors are described internally using learnable parameters.
A vector has both magnitude (length) and direction. Figure 1 shows an example vector representing a "dog" class along with two example data points. To score the data against this dog vector, we project each data point onto the dog vector and measure the length of that projection.
Geometrically, the projection occurs by finding the closest point to the data point that lies on the line defined by the dog vector. The length of the projection is then the distance between the origin and that point. The numbers 1 and 2 on Figure 1 represent the lengths of the two projections.
Figure 1: These two data points are scored against a candidate "dog" vector by measuring the length of their projection onto the vector. The closer the data points are to the line (direction of the vector) and the farther away from the origin the points are, the larger their "dog" score.
It can happen that the projection of a data point onto the line defined by a vector falls on the opposite side of that line as does the vector. This is illustrated in Figure 2 where the left data point falls onto the "not cat" side of the "cat" vector. When this happens, the "length" computed for the score is assigned a negative value.
Figure 2: The same two data points from Figure 1 are, in this figure, scored against a different vector following the same procedure of measuring a projection. This example illustrates that a projection can have a negative length; this happens when the data point's direction from the origin is closer to the negative (opposite) of the vector being used for scoring.
In Figures 1 and 2 we illustrated the case of a single class classification problem. In those cases, the final layer is scoring the data (or transformed data) against a single vector. The user will determine a score threshold that indicates "dog" (as in Figure 1) or "cat" (as in Figure 2), i.e. scores above the threshold are a "dog," and scores below the threshold are not a dog.
In these cases (single class classification), a larger score should be interpreted to mean a higher confidence in the output; for example, in Figure 1 the "dog" scores of 2 and 1 should be interpreted as the model being more confident that the left data point (score of 2) is a dog than the other point. Note that without knowing the threshold, we don't know whether or not a score of 2 is high enough (over the threshold) to be confident the data point is a dog (or not).
For multi-class classification (below) the interpretation of those scores will vary somewhat; and, for regression models, this score is simply the numerical output of the model, the interpretation of which is left to the practitioner.
Figure 3 shows an example of multi-class classification. The key difference now is that we have multiple vectors against which to score each data point. Individually the two scores are computed exactly as they would have been in the single-class classification example. Specifically, both points have the same dog score as before and the same cat score as before; the difference is that now, each data point has both scores.
For each data point, the largest score associated with it is interpreted as the most likely class for that item, e.g. for the right data point, we interpret the cat score 2 and dog score 1 to mean that the model believes it is more likely that this point is a cat than a dog. Similarly, for the left data point, we interpret the cat score -1 and dog score 2 to mean that it is more likely that the data point is a dog than a cat.
Figure 3: When determining how to classify data points, we score each data point against each class vector; each data point is classified according to the largest score it receives. In this example, we see that one data point (to the right) would be classified as a cat and the other data point (to the left) would be classified as a dog.
A word of caution: it may be tempting to try to compare the likelihood of the right data point being a cat to the likelihood of the left data point being a dog; after all, the difference in scores is much greater for the point on the left. Unfortunately, such interpretation (though somewhat common) is problematic from both technical (what functions are actually being applied and what do they mean) and philosophical (what does it mean to be a probability or likelihood) perspectives—the discussion of which is beyond the scope of this post.
The Anatomy of a Single Layer Neural Network
Figure 4 illustrates the process of a simple, single-layer network. The input features (x, y) are used to determine a cat score and a dog score. Note that the parameters a, b, c, and d are learned parameters—values that vary throughout training. In the illustrations provided in Figures 1, 2, and 3 above, it could be the case that the cat vector is represented by <a=3, b=3> and the dog vector is represented by <c=-3, d=3>.
Figure 4: This is an illustration of a simple linear model. It computes two scores (cat score and dog score) by measuring the length of projections (as described in Figures 1 and 2). When the input data point is (x, y) the value ax + by is the length of the projection of that point onto the vector <a, b>. Note that a, b, c, and d are learned parameters; the purpose of training is to find the best possible (or good enough) values for these parameters.
Note that in this simple example, the dimension of the input is two (x and y) and the dimension of the output is also two (cat score, and dog score). Those dimensions are simply for convenience in drawing the example. A real network may have (essentially) any number of input dimensions, each dimension being either a characteristic of the example, e.g. height or weight, or a pixel value if that data is an image, etc. And it could have any number of output dimensions, typically one for each category the model could predict.
When Is a Single Layer Enough?
So far, we have illustrated what happens in the last layer of a neural network (or the only layer in a single-layer network). The question remains, when does such a simple model suffice?
Figure 5 shows an example dataset containing examples of dogs and cats (orange and purple). Note that the vertical line of the y-axis perfectly separates the two categories of data (actually, there are many lines which perfectly separate the two categories).
When a single, straight line can separate the examples of two categories in a dataset we call the dataset linearly separable. When data is linearly separable a single-layer network (as illustrated above, possibly with a bias term not shown) is sufficient to accurately differentiate the categories. For datasets with more than two categories, the exact criterion (for success with a single layer network) is different, though the intuition remains the same: when data points from the various categories are internally grouped and separated from the other categories, then single-layer models are sufficient.
Figure 5: This example shows data from two classes with the color corresponding to the human provided label (ground truth). This dataset is called linearly separable because one can find a (straight) line which perfectly separates the two classes of data.
Training a Single-Layer Network
A single layer network has one objective: to learn (find) the best (or good enough) vector to represent each category of the dataset. Figure 4 gave an example of a simple network learning two vectors, a cat vector (represented by <a, b>) and a dog vector (represented by <c, d>). When training starts, these 4 values (a, b, c, and d) are chosen randomly. These random initial vectors are illustrated in Figure 6 as the lightest colored orange and purple vectors.
Each step of training uses information about the dataset and the current vector representation to determine (mathematically) how to adjust the vectors to improve their performance. This iterative process is illustrated in Figure 6 as the vectors move towards the final examples (longest, darkest vectors). Training may either happen for a fixed number of (small) updates to the vectors, or it may happen until the update to the vectors is small enough that we consider them to no longer be updating.
Figure 6: This illustrates the iterative process of learning (finding) better vectors by which to score the data. Training starts with the lightest orange and purple vectors which are chosen at random. Throughout the training process the direction and length of vectors is gradually updated, ideally improving the outcome with each step taken.
What Happens When a Dataset Isn't Linearly Separable?
Most (real) datasets are not as cleanly separable as the dataset in Figure 5. An example of such a dataset is shown in Figure 6 where it is clear that no single line can be drawn to separate all the orange points from the purple ones. A single layer, linear model (as previously described) will not perform very well on this dataset, e.g. it may be 100% accurate on the purple dataset but at the cost of being at most 50% accurate on the orange dataset (other tradeoffs are possible, but not better ones). In order to make a model more expressive (more flexible) we need to add some complexity. For neural networks that complexity comes in two ways.
Figure 7: This is an example of data which is not linearly separable: no single (straight) line drawn could perfectly separate (all) the orange data points from (all) the purple data points. The fact that this dataset is not linearly separable suggests that a simple linear model (e.g. as in Figure 6) will not perform very well on this data.
First, we insert additional layers to the model. Figure 8 illustrates a somewhat more complex neural network. The essential idea of additional layers is that you can use the output (scores) from one layer to act as inputs to subsequent layers. In the literature, people usually call these intermediate outputs features (or feature scores) so that the vectors the neural network is learning represent features which may be present in the data. The model then learns to detect these features and make determinations about the data based off these derived features, instead of solely on the human generated/measured/engineered features originally used as input. With additional layers in our neural network, two things now happen:
- The initial layers (essentially all layers except for the final layer) transform the raw input data into useful features.
- The final layer learns vector representations of the output based on the transformed data (or, model engineered features).
Figure 8: This illustrates a simple two-layer model with a non-linear activation between the layers. The first layer takes the input data and transforms it; the non-linear activation adds a needed element of non-linearity between layers; and a final layer which scores the hidden features in the same manner as the simple, single-layer model scores the raw data.
Second, we must add a non-linear function (layer) to the model. Without diving into mathematical proof, if we stack a sequence of linear layers onto a network (without any non-linear layers in-between), the result is mathematically equivalent to a single linear layer and we are back to where we started (attempting to separate the data with a single line).
Figure 8 contains a non-linear activation between the two layers. An activation layer applies a simple function to a single input value. For example, a common activation function is ReLU (rectified linear unit). The output of a ReLU activation is 0 if the input is negative; the output is the same as the input if the input is positive. There are many variations of these activation functions, the details of which are out of scope of this article.
Using the initial transformations shown in Figure 8 (the first layer with a non-linear activation layer), we can transform the input data shown in Figure 7, which is not linearly separable, into the linearly separable dataset shown below in Figure 9. After being transformed in this manner (so that the results are linearly separable), a single additional layer (final layer) is sufficient to complete the classification accurately.
Figure 9: This dataset is a transformed version of the dataset in Figure 7. It has passed through a single linear layer and a non-linear activation function. These resulting hidden features are clearly linearly separable. Since these hidden features are linearly separable, applying a simple linear model to these hidden features can be expected to perform very well. Note that the two-layer example in Figure 8 includes the simple linear model as the final (second) step immediately following the hidden features.
Neural networks contain a sequence of one or more functions with learnable parameters. Layers inside the neural network either transform the data into machine interpretable features or use those features to classify the input (or regress it, etc).
The architecture of a neural network—the number, size, shape, and composition of the layers— will determine the network's expressive power and limitations. Finding a suitable architecture for any given dataset requires some amount of experimentation, evaluation, and an acceptance criterion. We will discuss finding suitable architectures in detail in a future post. During training, the learned parameters change in response to the training data to improve these transformations and the subsequent predictions derived from them.