In the late 1990s, researchers Hochreiter and Schmidhuber proposed the Long Short-Term Memory (LSTM), which enables Recurrent Neural Networks (RNNs) to retain information over extended sequences rather than merely between consecutive time steps. Originally published in 1997, LSTMs gained significant recognition through victories in prediction competitions during the mid-2000s and became the dominant architecture for sequence learning from 2011 until the rise of Transformer models beginning in 2017. Even Transformers owe some of their key ideas to architectural design innovations first introduced by the LSTM. An LSTM-based RNN shares the same high-level architecture as a basic RNN (whether simple, bidirectional, or deep), but replaces standard activation functions with specialized LSTM cells.

University of Michigan - Ann Arbor

Google

Claude

RNNs come in many variants, including:
 - Bidirectional RNNs
 - Memory Units (LSTMs) RNNs
 - Fully recurrent 
 - Elman/Jordan networks, or Simple Recurrent Networks (SRN)
 - Recursive neural network
 - Hopfield
 - Echo State Network (ESN)
 - Stacked RNNs
 - Hierarchical
 - Neural Turing machines (NTMs)
 - Differentiable Neural Computer (DNC) 
 - Recurrent Multi-Layer Perceptron (RMLP)
 - Independent RNN (IndRNN)
 - Neural history compressor 
 - Second order RNNs 
 - Gated recurrent unit (GRU)
 - Continuous-Time Recurrent Neural Network (CTRNN)
 - Multiple Timescales Recurrent Neural Network (MTRNN) 
 - Neural Network Pushdown Automata (NNPDA)

RNN Extensions and Types

The long short-term memory (LSTM) network is an extension to RNNs, in addressing problems such as the vanishing gradient problem and the inability of RNNs to carry forward critical information.

LSTMs divide the context management problem into two sub-problems: 
- Removing information no longer needed from the context
- Adding information likely to be needed for later decision making

Long Short-Term Memory (LSTM) Network

A helpful website that introduces neural networks:
https://missinglink.ai/guides/neural-network-concepts/

Neural Network Reference

Dive into Deep Learning

A bidirectional RNN assumes that the correct output not only depends on the previous inputs in the time series but also on future inputs. For example, in translation models, it is often necessary to use a word from the end of the source sentence in order to predict a word early in the target sentence. 

To make this possible, two RNNs are stacked on top of each other, one going from beginning to end and the other from end to beginning, and the output is computed based on hidden states of both networks.

Bidirectional RNNs

Stacked RNNs consist of multiple networks where the output of one layer serves as the input to a subsequent layer. The initial layers of stacked networks can induce representations that serve as useful abstractions for further layers — representations that might prove difficult to induce in a single RNN. So it can outperform single-layer networks.

Stacked RNNs

Gated Recurrent Units (GRUs) resolve the vanishing gradients problem in simple RNNs and also ease the burden of introducing a considerable number of additional parameters in the LSTMs, by dispensing with the use of a separate context vector, and by reducing the number of gates to 2 — a reset/relevance gate and an update gate.
The purpose of the reset gate is to decide which aspects of the previous hidden state are relevant to the current context and what can be ignored. It computes an intermediate representation for the new hidden state at the current time.
The purpose of the update gate is to determine which aspects of this new state will be used directly in the new hidden state and which aspects of the previous state need to be preserved for future use.

Gated recurrent unit (GRU)

NTMs(Neural Turing Machines) are neural network architectures that use external memory. These models are able to infer algorithmic tasks like copying, sorting, associative recall. According to the inventors of this architecture via NTMs they: "extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes"
These models are relatively new compared to LSTMs and RNNs. They were created in 2014. 

Neural Turing Machines (NTM)

https://arxiv.org/pdf/1410.5401.pdf?fbclid=IwAR1mft01ZW2dGNGyVczj4HR6E5H8xKpqTlniZeuI8JxGuSNzOdoNupq4j9M

Neural Turing Machines - Original Paper Reference

LSTM-Based RNN Architecture

An LSTM cell allows the network to pick up such pertinent information and save it, injecting it back into the model when necessary.

While a basic RNN has only a simple activation function, an LSTM cell has four. Three sigmoid activation functions output numbers between 0 and 1, and lead to a pointwise multiplication gate. This gate determines whether information should enter the cell or not—0 means no information enters, and 1 means all the information enters. These three cells are used to save pertinent information for later stages of the learning process. 

In the training process, the network learns what are the optimal values for the gates, or how much of the information should be retained to help the network make the most accurate prediction.

LSTM Cell

LSTMs can be applied to a variety of deep learning tasks that mostly include prediction based on previous information. Three noteworthy examples include:
 - Text prediction:
The long-term memory capabilities of LSTM means it excels at predicting text sequences. In order to predict the next word in a sentence, the network has to retain all the words that preceded it. One of the most common applications of text prediction is in chatbots used by eCommerce sites.
 - Stock prediction:
Simple Machine Learning (SML) models are able to predict stock values and prices based on inputs such as the opening value and the volume of the stock. While these values do take part in stock prediction, they lack a key component. To properly predict a stock value with high accuracy, the model needs to take into account one of the biggest factors—the trend of the stock. To do so, the model needs to identify the trend based on the values recorded over the preceding days—a task suited to an LSTM network.
- Music composition: LSTM can be applied considering that music is built using long sequences of notes, much like text uses long sequences of words.

Applictaions of Long Short-Term Memory Networks (LSTMs)

Training sequence models like Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs) is computationally costly. This high expense is primarily due to the need to process long-range dependencies within the sequence. Because of these computational bottlenecks, alternative architectures such as Transformers are often preferred for modeling complex sequences.

Computational Cost of Training Sequence Models

Similar to vanilla Recurrent Neural Networks (RNNs), a Long Short-Term Memory (LSTM) model can be implemented concisely by directly instantiating high-level API modules in modern deep learning frameworks. This approach encapsulates all the low-level configuration details, such as explicitly defining the input, forget, and output gates or manually initializing their weights and biases. Using high-level APIs allows the model to execute significantly faster, as the operations are performed using highly optimized, compiled backend operators rather than iterating through standard Python loops.

Learn Before

Related