Neural networks are a class of simple, yet effective, computing systems with a diverse range of applications. These systems comprise large numbers of small, efficient computational units which are organized to form large, interconnected networks capable of carrying out complex calculations. These pages provide an overview of a collection of core concepts for neural network design as well as a brief introduction to the optimization procedures used to train these networks.

Additional resources are available on the Stanford CS231n course website and in the freely available book Deep Learning by Goodfellow, Bengio, and Courville.

Gradient descent provides a simple, iterative algorithm for finding local minima of a real-valued function \(F\) numerically. The main idea behind gradient descent is relatively straightforward: compute the gradient of the function that we want to minimize and take a step in the direction of steepest descent: i.e. \(-\nabla F(x)\). The iteration step of the algorithm is defined in terms of a step size parameter \(\alpha\) ( or by a decreasing sequence \(\{\alpha_i\}\) ) via:

\[ x_{i+1} \, \, = \, \, x_{i} \, - \, \alpha\cdot\nabla F(x_{i}) \]

In the context of neural networks, gradient descent appears to provide a reasonable approach for tuning network parameters. The initial weights and biases can be interpreted as a single vector \(\theta_0\), and the iteration steps from the previous slide could, in theory, be used to identify the optimal parameters \(\theta^*\) for the model. The issue with this approach is that the function we are actually trying to minimize is defined in terms of the entire dataset \(\mathcal{D}\):

\[ F(\theta) \, \, = \, \, \frac{1}{|\mathcal{D}|} \, \sum_{x\in\mathcal{D}} \, f(x \, | \, \theta) \]

where \(f(x|\theta)\) denotes the loss for a single example \(x\) when using the model parameters \(\theta\). So the standard algorithm would require computing the average loss at each step of the iterative scheme\(\dots\)

Since computing the true gradient \(\nabla F(\theta)\) at every step is impractical for large datasets, we can instead try to approximate this gradient using a smaller, more managable mini-batch of data:

\[ \nabla F(\theta) \, \, \approx \, \, \nabla\widehat{F}_i(\theta) \, \, = \, \, \frac{1}{|\mathcal{B}_i|} \, \sum_{x\in\mathcal{B}_i} \, \nabla f(x \, | \, \theta) \]

where the batches \(\mathcal{B}_i\) partition the dataset into smaller subsets (typically of equal size).

The iteration step for the stochastic batch gradient descent method is then taken to be:

\[ \theta_{i+1} \, \, = \, \, \theta_{i} \, - \, \alpha\cdot\nabla \widehat{F}_i(\theta_{i}) \]

There are a few potential obstacles with this approach however:

- Fixed learning rates typically lead to suboptimal performance
- Defining a learning rate schedule manually does not allow the algorithm to adapt to the particular problem in consideration
- Different parameters often require different learning rates
- Since the directions/magnitudes of previous updates are not taken into consideration, defining optimization policies on small batches of data may lead to a noisy, inefficient training process

In particular, the choice of a suitable learning rate is essential for effective training of a neural network model. The choice of a learning rate which is too high or too low can lead to poor convergence (or extremely slow convergence) as illustrated below:

In this section we provide a brief review of the Adam optimization algorithm. The goal of the algorithm is to effectively minimize a real-valued, stochastic objective function \(f(\theta)\) by selecting the optimal values for a collection of parameters \(\theta\). We assume the objective function is differentiable w.r.t. these parameters and, for our purposes, that the stochasticity in the objective function is the result of evaluating a fixed loss funtion on a varying collection of batches/subsamples \(\{\mathcal{D}_t\}\) from a global dataset \(\mathcal{D}\). The brief summary provided in this section is adapted directly from the original paper; readers are encouraged to refer to the original paper for a more detailed analysis.

The Adam optimization algorithm was introduced by Kingma and Ba ("A Method for Stochastic Optimization" , 2014) and derives its name frome the concept of 'adaptive moment estimation'. As the name suggests, the intuition behind the algorithm is that the efficiency of a gradient descent based optimization scheme can be improved by tracking the moments of the gradient estimates and incorporating this information into the update policy of the numerical scheme.
In particular, the Adam algorithm tracks the first two uncentered moments of the objective function gradient at each time step:
\[
\{m_t, v_t\} \hspace{0.25in} \mbox{with} \hspace{0.25in} \mathbb{E}[m_t] \, \, \approx \, \, \mathbb{E}[\,\nabla_\theta\, f_t(\theta)\,] \hspace{0.15in} \mbox{and} \hspace{0.15in} \mathbb{E}[v_t] \, \, \approx \, \, \mathbb{E}\big[\,(\nabla_\theta\, f_t(\theta))^2\,\big]
\]
where we denote by \(f_t(\theta)\) the evaluation of the fixed loss function on the particular batch of data occuring at timestep \(t\). This information is incorporated into the gradient descent algorithm by rescaling the stepsize of each parameter update by a factor of \(m_t/\sqrt{v_t}\) at each step. This is motivated by the intuition that \(m_t/\sqrt{v_t}\) can, loosely speaking, be interpreted as a signal-to-noise ratio for the stochastic gradient estimates. Accordingly, the stepsize of the descent scheme is automatically reduced in regions where the gradient estimates are becoming increasingly noisy. This is of particular importance when the scheme has arrived near a true minimum of the objective function; at such a minimum, the gradient estimates will simply reduce to mean-zero noise (caused by evaluation at varying subsamples of the full dataset).
In the context of the Adam optimization algorithm, the *optimal* parameters \(\theta^*\) are taken to be those which minimize the expected value of the objective function; i.e. \(\theta^* \, = \, \operatorname{argmin}_\theta \, \mathbb{E}[f(\theta)]\) where the expectation is taken w.r.t. the batch data \(\{\mathcal{D}_t\}\).

An outline of the implementation details of the Adam algorithm is provided above. The main idea is to track the moment estimates by maintaining two exponential moving averages \(\{m_t\}\) and \(\{v_t\}\). Of note, however, is the fact that the moment estimates are biased by the fact that they have been arbitrarily initialized to zero. Fortunately, this bias can be easily accounted for using the observation that: \[ m_t \, = \, (1 \, - \, \beta_1)\, \sum\nolimits_{i=1}^{\,t} \, \beta_1^{t-i} \cdot g_i \hspace{0.2in} \mbox{and} \hspace{0.2in} v_t \, = \, (1 \, - \, \beta_2)\, \sum\nolimits_{i=1}^{\,t} \, \beta_2^{t-i} \cdot g_i^{\,2} \] where \(g_i\) denotes the gradient estimate \(\nabla_\theta \,f_i(\theta_{i-1})\) at the \(i^{th}\) timestep; these equalities follow from the exponential moving average construction and a simple expansion of the recursive definitions of \(m_t\) and \(v_t\). In the case where the true first and second moments are stationary, i.e. \(\mathbb{E}[g_i]\) and \(\mathbb{E}[g^2_i]\) are constant at each step, it follows that: \[ \mathbb{E}[m_t] \, \, = \, \, \mathbb{E}\left[ (1-\beta_1)\, \sum_{i=1}^{t} \beta_1^{t-i}\cdot g_i \right] \] \[ = \, \, \mathbb{E}[g_t] \cdot (1-\beta_1)\, \sum_{i=1}^t \, \beta_1^{t-i} \] \[ = \, \, \mathbb{E}[g_t] \cdot (1-\beta_1^t) \] Similarly, we have: \[ \mathbb{E}[v_t] \, \, = \, \, \mathbb{E}\left[ (1-\beta_2)\, \sum_{i=1}^{t} \beta_2^{t-i}\cdot g^{\,2}_i \right] \] \[ = \, \, \mathbb{E}[g^{\,2}_t] \cdot (1-\beta_2)\, \sum_{i=1}^t \, \beta_2^{t-i} \] \[ = \, \, \mathbb{E}[g^{\,2}_t] \cdot (1-\beta_2^t) \] Accordingly, the initialization bias can be removed from the moment estimates by simply defining \(\widehat{m}_t \, = \, m_t/(1-\beta_1^t)\) and \(\widehat{v}_t \, = \, v_t/(1-\beta_2^t)\) at each step of the algorithm.

The backpropagation framework was introduced by David Rumelhart, Geoffrey Hinton, and Ronald Williams in their 1986 Nature article Learning representations by back-propagating errors.

In this section we will follow the notation of the original paper which dealt with sigmoidal activations \(\sigma\) and defined the loss \(E\) in terms of predictions \(y_j\) and true values \(d_j\) via:

\[ E \, \, = \, \, \frac{1}{2} \, \sum \, (y_j \, - \, d_j)^2 \hspace{0.1in} \]

The term "backpropagation" refers to the notion of *propagating* gradient calculations backward through the network. The associated computational flow can be broken up into a sequence of steps which are illustrated below:

As the diagrams above suggest, we begin by computing the gradient \(\partial E / \partial y_j\) of the loss with respect to one of the final layer's nodes. Repeated applications of the chain rule are then leveraged to extend the gradient calculations backward until we reach \(\partial E / \partial y_i\). This corresponds to the derivative with respect to the penultimate layer's output, and allows us to simply repeat the steps outlined above, now applied to the previous network layer.

Two commonly used methods for automating the process of computing derivatives are *symbolic differentiation* and *numeric differentiation*; however, both of these methods have severe practical limitations in the context of training neural networks.

- Symbolic differentiation produces exact derivatives through direct manipulation of the mathematical expressions used to define functions; the resulting expressions can be lengthy and contain unnecessary computations, however, and are inefficient unless additional "expression simplification" steps are included
- Numeric differentiation techniques are widely applicable and efficient; however, the resulting inexact gradient estimates can entirely undermine the training process for large networks

*Automatic differentiation* (also referred to as AD or autodiff) in "reverse mode" provides a generalization to backpropagation and gives us a way to carry out the required gradient calculations *exactly* and *efficiently*.

- Computes derivatives using the underlying computational graph
- Very efficient with respect to evaluation time
- A trace of all elementary operations is stored on an "evaluation tape", or "Wengert list"
- Potentially large storage requirements for storing the operation lists

For additional information regarding automatic differentiation, the following references may be instructive:

- Automatic differentiation in machine learning: a survey

Baydin, A.G., Pearlmutter, B.A., Radul, A.A. and Siskind, J.M., 2015. - Backpropagation and Automatic Differentiation

University of Washington CSE599W: Spring 2018 -- Slides from Lecture 4