The universal approximation theorem means that regardless of what function we are trying to learn, we know that a large feedforward network will be able to represent this function. We are not guaranteed, however, that the training algorithm will be able to learn that function. 

Even if the feedforward network is able to represent the function, learning can fail for two diﬀerent reasons. First, the optimization algorithm used for training may not be able to ﬁnd the value of the parameters that correspondsto the desired function. Second, the training algorithm might choose the wrong function as a result of overﬁtting. There is no universal procedure for examining a training set of speciﬁc examples and choosing a function that will generalize to points not in the training set.

Potential Flaws with Universal Approximation Theorem

The universal approximation theorem (Horniket al., 1989; Cybenko, 1989) states that a feedforward network with a linear output layer and at least one hidden layer with any “squashing” activation function (such as the logistic sigmoid activation function) can approximate any Borel measurable function from one ﬁnite-dimensional space to another with any desired nonzero amount of error, provided that the network is given enough hidden units. The derivatives of the feedforward network can also approximate the derivatives of the function arbitrarily well.

It means that regardless of what function we are trying to learn, we know that a large feedforward network will be able to represent this function. 

University of Michigan - Ann Arbor

Feedforward neural networks are called networks because they are typically represented by composing together many different functions. The model is associated with a directed acyclic graph describing how the functions are composed together.

For example, we might have three functions $f^{(1)} , f ^{(2)}$, and $f^{(3)}$ connected in a chain, to form $f (x) = f ^{(3)}(f^{(2)}(f^{(1)}(x)))$. These chain structures are the most commonly used structures of neural networks. 


Overall Structure of Deep Feedforward Networks

Goodfellow, I., Bengio, Y., & Courville, A. (2016). $\mathit{Deep \ Learning.}$ MIT Press. Retrieved from [www.deeplearningbook.org](https://www.deeplearningbook.org) 

Deep Learning

For a feedforward neural network, the depth of the network is the number of hidden layers plus one (as the output layer is also parameterized). The width of the network is the dimensionality of its hidden layers.

The main architectural considerations for designing a neural network are choosing the depth of the network and the width of each layer.


Depth and Width for Neural Networks

In the chain structure example $f (x) = f ^{(3)}(f^{(2)}(f^{(1)}(x)))$, $f^{(1)}$ is called
the first layer of the network, f^{(2)} is called the second layer, and so on. The final layer of a feedforward network is called the output layer. 

During neural network training, we drive $f(x)$ to match $f^∗(x)$. The training data provides us with noisy, approximate examples of $f^∗(x)$ evaluated at different training points. Each example x is accompanied by a label $y ≈ f^*(x)$.

The training examples specify directly that the output layer must produce a value that is close to $y$ at each point $x$. However, the behavior of the other layers is not directly specified by the training data. The training data do not say what each individual layer should do. Instead, the learning algorithm must decide how to use these layers to best implement an approximation of $f^*$. Because the training data does not show the desired output for each of these layers, they are called hidden layers.

Layers of a Feed Forward Neural Network

Universal Approximation Theorem

- $$L$$: number of layers
- $$m$$: number of training datapoints
- $$n^{[l]}$$: number of units (neurons) in layer $$l$$
- $$A^{[l]} = g^{[l]}(Z^{[l]})$$: activations (outputs) in layer $$l$$
- $$Z^{[l]}, A^{[l]}$$ dimensions: $$(n^{[l]}, m)$$
- $$X = A^{[0]}$$
- $$\hat{Y} = A^{[L]}$$
- $$W^{[l]}$$: weights for $$Z^{[l]}$$
- $$W^{[l]}$$ dimensions: $$(n^{[l]}, n^{[l - 1]})$$
- $$b^{[l]}$$: biases for $$Z^{[l]}$$
- $$b^{[l]}$$ dimensions: $$(n^{[l]}, 1)$$

Feedforward Neural Network Notation

The illustration below shows the relationship between the fields of Artificial Intelligence. Deep learning is a subset of machine learning. 

Refernce : https://lupinepublishers.com/material-science-journal/pdf/MAMS.MS.ID.000138.pdf

Set relationship of Artificial Intelligence

The possibility of Deep Learning is enormous, but it also has a limitation which we need to understand for better utilization. 

a. The systems need 10,000+ examples to learn a concept
like cows. Humans only need a handful of examples.

b. Deep Learning cannot explain how the systems got an
answer.

c. Deep Learning lacks common sense. This makes the
systems fragile and when errors are made, the errors can be
very large.


Limitation of Deep Learning

- It requires very large amount of data in order to perform better than other techniques.

- It is extremely expensive to train due to complex data models. Moreover deep learning requires expensive GPUs and hundreds of machines. This increases cost to the users.

- There is no standard theory to guide you in selecting right deep learning tools as it requires knowledge of topology, training method and other parameters. As a result it is difficult to be adopted by less skilled people.

- It is not easy to comprehend output based on mere learning and requires classifiers to do so. Convolutional neural network based algorithms perform such tasks.

Drawbacks or disadvantages of Deep Learning

In this node, we describe how to use deep learning to solve applications in computer vision, speech recognition, natural language processing, and other areas of commercial interest. 

Applications of Deep Learning

One of the most significant advantages of deep learning is its ability to replace the labor-intensive process of manual feature engineering. In traditional machine learning, domain experts crafted specific algorithms to transform raw perceptual data into feature vectors. Deep learning replaces these domain-specific preprocessing steps with automatically tuned filters that are learned jointly from the data, yielding superior accuracy and providing a unified approach across diverse fields.

Learn Before

Related

Learn After