A feedforward network with a single layer is suﬃcient to representany function, but the layer may be infeasibly large and may fail to learn and generalize correctly. In many circumstances, using deeper models can reduce the number of units required to represent the desired function and can reduce the amount of generalization error.

As is shown in the figure, empirical results show that deeper networks generalize better when used to transcribe multidigit numbers from photographs of addresses. Data from Goodfellow et al. (2014d). The test set accuracy consistently increases with increasing depth.


University of Michigan - Ann Arbor

For a feedforward neural network, the depth of the network is the number of hidden layers plus one (as the output layer is also parameterized). The width of the network is the dimensionality of its hidden layers.

The main architectural considerations for designing a neural network are choosing the depth of the network and the width of each layer.


Depth and Width for Neural Networks

The no free lunch theorem is proof that no single regularization is strategy is certifiably the best at every task, however that is not to say that every strategy is created equal. There may be no best general strategy, but there may be strategies that are often times more useful than others. That being the case, here is a list of some regularization assumptions/ strategies that have been shown to be generally useful:

- Smoothness
- Linearity
- Multiple explanatory factors
- Causal factors
- Depth/ hierarchical factors
- Shared factors across tasks
- Manifolds
- Natural clustering
- Temporal and spatial coherence
- Sparsity
- Simplicity of factor dependencies.


Appropriate Regularization/ Representation

Goodfellow, I., Bengio, Y., & Courville, A. (2016). $\mathit{Deep \ Learning.}$ MIT Press. Retrieved from [www.deeplearningbook.org](https://www.deeplearningbook.org) 

Deep Learning

Effect of Depth for Neural Networks

In a neural network with many time steps or layers, a gradient at the early layer is the product of all the terms from the later layers, which leads to an inherently unstable situation. Especially when the value of gradient has become so small, it no longer updates properly or is vanished eventually. Exploding gradient can be considered as the opposite of vanishing process. The updated weights using gradient descent become so large that they cause the whole network to become unstable, which leads to numerical overflow.

Vanishing/exploding gradient

The depth of a model can be measured in two main approaches:
- Analyzing the length of sequential pathway of an input to the output. The depth can vary by the definition of a unit operation. The figure below illustrates how a logistic regression can be viewed as a model with a depth of 3 or 1.
- Analyzing how the concepts are related to each other.

Measuring the depth of the model

In a Feed-forward NN, we use the linear matrix multiplication for the forward propagation and each element in the output is dependent on each element in the input. Now imagine the input is a 500 x 500 image - including the three image channels, we have the input of 3 x 500 x 500 = 750,000 dimensions. Considering this number as only one side of the parameter matrix in the first layer, this means a lot of parameters in the first layer.

In contrast, in a convolution layer, each output value only depends on a small number of inputs, allowing us to significantly improve on efficiency, and decrease memory requirements. This way, we will be able to use very large input images.

Sparsity of Connections in Convolutional Neural Networks

Causal inference refers to making conclusions about finding the reasonings behind the relationship of two or more variables. After coming to some conclusions about statistical data, we ask why such relationships exist between variables.

Causal Inference

In order to fit a smooth curve of a set of data, we need to find a function g(x) to fit the observed data well. We have to make g to be able to keep $RSS = \sum_{i=1}^{n} (y_i – g(x_i))^2$ as small as possible and the curve as smooth as possible. 

Minimize function g: Loss & Penalty; $\lambda$ is tuning parameter
$\sum_{i=1}^{n} (y_i – g(x_i))^2 + \lambda \int g^n(t)^2 \, dt$


Smoothing Splines

Clustering tries to place unlabeled samples into groups based on similar characteristics. 
An example, these samples can be distinguished into 5 categories by some observed characteristic. Let's say we are clustering textbook samples by different school subjects (math, science, history, english, art). 
![The image from the cited reference](https://firebasestorage.googleapis.com/v0/b/onecademy-1.appspot.com/o/UploadedImages%2FMadeline-Rosenberg_Wed%2C%2019%20Feb%202020%2015%3A27%3A54%20GMT.png?alt=media&token=5b890a1d-803b-4ad5-a65f-1ec4043225c2) 

Clustering, an unsupervised statistical learning method

 Good at finding low dimensional structure in high dimensional data 
Methods:
Multidimensional scaling (MDS): attempts to find a distance-preserving low-dimensional projection
Preserves information about how points in original dataspace are close to each other


Manifold learning algorithms

While a parameter norm penalty is one way to regularize parameters to be close to one another, the more popular way is to use constraints: to force sets of parameters to be equal. This method of regularization is often referred to as parameter sharing, because we interpret the various models or model components as sharing a unique set of parameters. A signiﬁcant advantage of parameter sharing over regularizing the parameters to be close (via a norm penalty) is that only a subset of the parameters (the unique set) needs to be stored in memory. In certain models—such as the convolutional neural network—this can lead to signiﬁcant reduction in the memory footprint of the model.



	Parameter sharing: to force sets of parameters to be equal

	Main application: Convolutional Neural Network (CNN)

	Advantages: signiﬁcant reduction in the memory footprint of the model




Parameter Sharing

The idea that the exact value of any underlying factors that created a given input may change over time or distance. That being the case, any difference in time or distance between when the input was captured compared to the training output may result in some noise corresponding to the exact space-time discrepancy. As such, it may be that the underlying factors that caused the input are easier to determine than the precise output value that was recorded.

Temporal and Spatial Coherence

The assumption that the relationships between underlying factors are simple to express. As an example, it is common to assume that these factors are independent of each other. This results in the probability of any factor set $h$ is simply equal to the product of the probabilities for each  factor state $h_i$.
$$P(h) = \prod P(h_i)$$
Many machine learning algorithms assume either this independence or some other simple/ easy to calculate relationship.

Learn Before

Related