This setup shows a window size of N=3 context words. The input x consisting of vectors fully connected to the embedding layer via 3 instantiations of embedding matrix.

Neural LM architecture 

When training a neural language model minimizing loss will result in a algorithm for language modeling, and a new set of embeddings that can be used as word representations for other tasks. With that in priority, training proceeds by taking an input as very long text, concatenating all the sentences, starting random weights, and iteratively moving through the text predicting each word. The architecture consists of setting the parameters θ = E, W, U, b and via grading descent, using error backpropagation on the computation graph to compute the gradient. Thus, effectively setting the weights (W, U), and learning embeddings (E) to predict upcoming words.

University of Texas at San Antonio

- Statistical Language Models
    - N-Gram
- Neural Language Models
    - RNNs and tis extensions like LSTMs and GRUs
    - Word embedding methods: BERT, ELMo, GPT-2

Types of Language Models

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Speech and Language Processing (3rd ed. draft) 

A webpage introducing language models in NLP
https://dev.to/amananandrai/language-models-in-nlp-21jn

Anatomy of Language Models In NLP

An n-gram is a sequence of n tokens, where training models based on this are straightforward and have been used as the building block of statistical language modeling for many years. There are also specific names for small values of n, such as unigram for n - 1, or bigram for n - 2, and trigram for n - 3.

n-grams

Concept for Training neural LM

Neural Language Models (NLMs) are a class of language models that leverage neural networks, marking a significant breakthrough in Natural Language Processing (NLP) with the advancement of deep learning. A key innovation of NLMs is their use of distributed representations, known as word embeddings, which map words into a continuous vector space. This approach allows a compact and dense neural model to represent an exponentially large number of word sequences, effectively overcoming the curse of dimensionality that limited traditional n-gram models. This enables NLMs to generalize better across different contexts and handle complex tasks with superior performance.

Neural Language Models (NLMs)

For quite a long period, particularly before 2010, the dominant approach to language modeling was the $$n$$-gram approach. In $$n$$-gram language modeling, we estimate the probability of a word given its preceding $$n-1$$ words, and thus the probability of a sequence can be approximated by the product of a series of $$n$$-gram probabilities. These probabilities are typically estimated by collecting smoothed relative counts of $$n$$-grams in text. This straightforward approach has been extensively used in NLP, and the success of modern statistical speech recognition and machine translation systems has largely depended on the utilization of $$n$$-gram language models.

Learn Before

Related

Learn After