Training sequence models like Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs) is computationally costly. This high expense is primarily due to the need to process long-range dependencies within the sequence. Because of these computational bottlenecks, alternative architectures such as Transformers are often preferred for modeling complex sequences.

Claude

The long short-term memory (LSTM) network is an extension to RNNs, in addressing problems such as the vanishing gradient problem and the inability of RNNs to carry forward critical information.

LSTMs divide the context management problem into two sub-problems: 
- Removing information no longer needed from the context
- Adding information likely to be needed for later decision making

Long Short-Term Memory (LSTM) Network

Dive into Deep Learning

An LSTM cell allows the network to pick up such pertinent information and save it, injecting it back into the model when necessary.

While a basic RNN has only a simple activation function, an LSTM cell has four. Three sigmoid activation functions output numbers between 0 and 1, and lead to a pointwise multiplication gate. This gate determines whether information should enter the cell or not—0 means no information enters, and 1 means all the information enters. These three cells are used to save pertinent information for later stages of the learning process. 

In the training process, the network learns what are the optimal values for the gates, or how much of the information should be retained to help the network make the most accurate prediction.

LSTM Cell

LSTMs can be applied to a variety of deep learning tasks that mostly include prediction based on previous information. Three noteworthy examples include:
 - Text prediction:
The long-term memory capabilities of LSTM means it excels at predicting text sequences. In order to predict the next word in a sentence, the network has to retain all the words that preceded it. One of the most common applications of text prediction is in chatbots used by eCommerce sites.
 - Stock prediction:
Simple Machine Learning (SML) models are able to predict stock values and prices based on inputs such as the opening value and the volume of the stock. While these values do take part in stock prediction, they lack a key component. To properly predict a stock value with high accuracy, the model needs to take into account one of the biggest factors—the trend of the stock. To do so, the model needs to identify the trend based on the values recorded over the preceding days—a task suited to an LSTM network.
- Music composition: LSTM can be applied considering that music is built using long sequences of notes, much like text uses long sequences of words.

Applictaions of Long Short-Term Memory Networks (LSTMs)

In the late 1990s, researchers Hochreiter and Schmidhuber proposed the Long Short-Term Memory (LSTM), which enables Recurrent Neural Networks (RNNs) to retain information over extended sequences rather than merely between consecutive time steps. Originally published in 1997, LSTMs gained significant recognition through victories in prediction competitions during the mid-2000s and became the dominant architecture for sequence learning from 2011 until the rise of Transformer models beginning in 2017. Even Transformers owe some of their key ideas to architectural design innovations first introduced by the LSTM. An LSTM-based RNN shares the same high-level architecture as a basic RNN (whether simple, bidirectional, or deep), but replaces standard activation functions with specialized LSTM cells.

LSTM-Based RNN Architecture

Computational Cost of Training Sequence Models

Similar to vanilla Recurrent Neural Networks (RNNs), a Long Short-Term Memory (LSTM) model can be implemented concisely by directly instantiating high-level API modules in modern deep learning frameworks. This approach encapsulates all the low-level configuration details, such as explicitly defining the input, forget, and output gates or manually initializing their weights and biases. Using high-level APIs allows the model to execute significantly faster, as the operations are performed using highly optimized, compiled backend operators rather than iterating through standard Python loops.

Learn Before

Related