Artificial Neural Network | Statistical Consulting Group

Artificial Neural Networks are methods of classification and/or regression meant to emulate our belief about how a brain or nervous system functions. There exists a network of nodes, or neurons, in which various input values are calculated on. If the end value matches some condition, the neuron fires.

Network topology refers to the structure of the network. How many neurons there are, how they are arranged, and so on. The artificial neural network I will describe here is also called a Multi-Layered Perceptron. It is a layered network of perceptrons.

The Perceptron is a mathematical construct. Within the perceptron, input values are assigned weights, and then the weighted values are summed. If that sum is greater than a cutoff threshold, then the perceptron fires (Outputs a 1). Otherwise, the perceptron doesn’t fire (Outputs a 0).

$\begin{equation*} Y = \sum_{i \in 1:p}w_iX_i\hspace{1cm}\text{Output } = I(Y\geq c) = \begin{cases} 1 & Y \geq c\\ 0 & Y < c\\ \end{cases} \end{equation*}$

For the first layer, the inputs to the perceptrons are the values coming from the input variables. At every layer, there is also a node with no inputs that fires an output to each of the nodes in that layer. For the second layer, and every layer thereafter, the neurons that fired will output a 1 to the next layer, while those that didn’t will output a 0.

In reality, due to the need for an activation function that is differentiable, we don’t use this identity function but rather a sigmoid function. A sigmoid function closely approximates the identity function, but is a continuous curve at all points rather than jumping at the activation threshold.

Training the ANN presents an interesting challenge. You create a cost function based on how you want the network to perform (correct classification), and the cost function increases with distance from perfect classification. In most cases, a single hidden layer will be sufficient to get the best solution for any classification problem.

Training the network means updating the various input weights to change whether the neurons fire or don’t fire. A common algorithm to do this is called the “Backpropogation” algorithm. For a given set of data, it recursively updates the weights to minimize the cost function. Each weight is an input variable to that minimization problem, so if we have a problem with 10 variables, 10 input nodes, 5 hidden nodes, and 1 output node, then we’re trying to find the global minimum of a figure in an $(11 + 55 + 110) = 176$ dimensional space.

The initialized weights for the untrained neural network are typically randomly drawn from a standard normal distribution. For this reason, it is important to normalize any numerical input variables so that their scale will not affect the output of the network.

Learning Rate is a value that determines how fast the neural network trains. A larger value here will cause the network to learn quickly, but it may bounce around the optimal point without reaching it.

Momentum is named so because it is similar to momentum in real life. Imagine, if you will, the current trained model is a ball, rolling down into a valley on that cost-function figure in the 176 dimensional space. Without momentum, if it reaches a local minima, it will stay there. With sufficient momentum, if it reaches a local minima, it will climb back up the hill and hopefully continue on to a global minimum. In reality, we never actually find the global minimum, we just hope the local minima we do find is comparable to it. Of course, having too much momentum will bound it out of the valley and cause us to miss the global minimum. So similar to learning rate, this is another tradeoff.

Additional Resources
Wikipedia – Artificial Neural Networks
Carlos Gershenson – Artificial Neural Networks for Beginners