For a feedforward neural network, the depth of the network is the number of hidden layers plus one (as the output layer is also parameterized). The width of the network is the dimensionality of its hidden layers.

The main architectural considerations for designing a neural network are choosing the depth of the network and the width of each layer.


Depth and Width for Neural Networks

Dropout means what percentage of neurons should be randomly “killed” during each epoch to prevent overfitting.

Dropout

Neural network learning rate means how fast the backpropagation algorithm performs gradient descent. A larger learning rate makes the network train faster but might result in missing the minimum of the loss function.

Neural Network Learning Rate

When you try to train a huge amount of data with limited computer memory, you can separate the whole training set to several batches that can fit into your computer memory. Then you feed these batches to your model one by one. After feeding all batches once, you complete one epoch. To successfully train your model, you need multiple epochs.

Epochs in Machine Learning

In a neural network, numeric data points, called inputs, are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer.
The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold. Or it can be a transformation that maps the input signals into output signals that are needed for the neural network to function.
Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. Activation functions also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1. When used in a multilayer neural network, activation functions can be different for different layers. 

Activation Functions in Neural Networks

When a neural network trains, it uses an algorithm to determine the optimal weights for the model, called an optimizer. 
There are several optimizer algorithms, such as:
- Gradient descent
- Mini-batch gradient descent
- Gradient descent with momentum
- RMSprop
- Adam optimization algorithm
- Nesterov momentum
- AdaGrad

Deep Learning Optimizer Algorithms

Initial weights are applied to all the neurons.
It is necessary to set initial weights for the first forward pass. Two basic options are to set weights to zero or to randomize them. However, this can result in a vanishing or exploding gradient, which will make it difficult to train the model. To mitigate this problem, you can use a heuristic (a formula tied to the number of neuron layers) to determine the weights. A common heuristic used for the Tanh activation is called Xavier initialization.

On top of that if the architecture being used has already been trained by someone else (such as ImageNet) the node values can be initialized to those values. This is known as transfer learning and it can be very useful to substantially increase training speed and model accuracy.

Deep Learning Weight Initialization

- Grid search
- Randomly choosing
- Coarse to fine

Hyperparameters Tuning Methods in Deep Learning

A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data. It is required by the model when making predictions and estimated or learned from data. It is often not set manually by the practitioner. E.g. The coefficients in a linear regression or logistic regression; the weights in an artificial neural network, etc.

A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data. It is often used in processes to help estimate model parameters. It is often specified by the practitioner. We may use rules of thumb, copy values used on other problems, or search for the best value by trial and error. E.g. The learning rate for training a neural network; the K in K-nearest neighbors, etc.

Difference between Model Parameter and Model Hyperparameter

The trade-off between the standard prediction loss and the additive weight decay penalty is characterized by the regularization constant, $$\lambda$$. This nonnegative hyperparameter is fit using validation data and modifies the objective to $$L(\mathbf{w}, b) + \frac{\lambda}{2} \|\mathbf{w}\|^2$$. When $$\lambda = 0$$, the original loss function is recovered. For $$\lambda > 0$$, the size of the weights is restricted, with larger values of $$\lambda$$ constraining the weights more considerably. The penalty term is divided by $$2$$ by convention so that the constant cancels out gracefully when the derivative of the quadratic function is taken.

Regularization Constant

Batch normalization is a technique designed to accelerate and stabilize the training of deep neural networks. Mechanistically, it centers and rescales the intermediate layer activations back to a controlled mean and variance, preventing their distributions from diverging across layers and over time. By keeping these intermediate values on a comparable scale, batch normalization enables the use of more aggressive learning rates. The technique was originally motivated by the concept of covariate shift applied to internal layers, but the hypothesis that it works by reducing this so-called internal covariate shift has since been challenged and does not appear to be a valid explanation for its effectiveness. Although intuitively thought to make the optimization landscape smoother, the precise mechanism by which batch normalization aids training remains an open research question. Despite this theoretical uncertainty, batch normalization has proven indispensable in practice, being applied in nearly all deployed image classifiers and earning the original paper tens of thousands of citations.

Batch Normalization

Hyperparameters related to neural network structure:
- Number of hidden layers (Depth)
- Number of hidden units (Width)
- Dropout method
- Activation function for each layer
- Weights Initialization

Hyperparameters related to training algorithm:
- Learning rate $\alpha$
- Momentum parameter $\beta \sim 0.9$
- $\beta_1 \sim 0.9, \beta_2 \sim 0.999, \epsilon \sim 10^{-8}$
- Number of Gradient descent iterations
- Mini-batch size 
- Optimizer algorithm
- Learning rate decay
- Regularization rate $\lambda$ 

University of Michigan - Ann Arbor

A hyperparameter is a setting that affects the structure or operation of the neural network. In real deep learning projects, tuning hyperparameters is the primary way to build a network that provides accurate predictions for a certain problem. 

Hyperparameters of Feedforward Neural Network

A helpful website that introduces neural networks:
https://missinglink.ai/guides/neural-network-concepts/

Neural Network Reference

One reason that people manipulate hyperparameters is to improve the bias and variance that a given model contains. Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Models with high bias pay little attention to the training data and oversimplify the model, which will lead to high errors on training and test sets. Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. High Variance models do not generalize based on previous data much, if at all, and thus while it may perform well on training data it will not on test sets. Briefly, the Bias vs Variance scale is a scale between over- and under-fitting of a model. By manipulating various hyperparameters, you can change between over- and under-fitting to get the best predictions

Bias and Variance in Deep Learning

Tuning or optimizing hyperparameters involves finding the values of each hyperparameter which will help the model provide the most accurate predictions.

Learn Before

Related

Learn After