The input to a Listen, Attend, and Spell (LAS) model is a sequence of $$t$$ acoustic feature vectors $$F = f_1, f_2, \dots, f_t$$, where each vector spans a frame of 10 milliseconds. Assuming the output consists of letters, the output sequence is $$Y = (\langle SOS \rangle, y_1, \dots, y_m, \langle EOS \rangle)$$, where $$\langle SOS \rangle$$ is a special start-of-speech token and $$\langle EOS \rangle$$ is a special end-of-speech token. The associated image shows the character set often selected for the English language.

Columbia University

Google

The standard encoder-decoder architecture is which is generally called the attention-based encoder decoder or AED, or listen attend and spell (LAS). The standard encoder decoder architecture for ASR is sketched by the following image, which contains the schematic architecture for an encoder-decoder speech recognizer. 

Listen attend and spell (LAS)


The first step in ASR is to transform the input waveform into a sequence of acoustic feature vectors, where each vector represents the information in a small time window of the signal.

Feature Vector

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Speech and Language Processing (3rd ed. draft) 

The encoder-decoder architecture is very appropriate when input and output sequences have significant difference in length, and they do for speech, since long acoustic feature sequences tend to map to much shorter sequences of letters or words. For example, a single word might be 5 letters long, but if it is spoken for 2 seconds, it would take 200 acoustic frames of 10 milliseconds each to map to. Due to this extreme length difference in speech, the encoder-decoder architectures for speech require a special compression stage that shortens the acoustic feature sequence before the stage of encoding. An alternative to this is using a loss function that deals with compression, like the CTC or connectionist temporal classification loss function.

Compression in LAS

An encoder-decoder model is basically a conditional language model, so encoder-decoders implicitly must learn a language model for the output domain of letters from training data. However, the training data, which is generally speech paired with text transcriptions, may not include as much text as you need to train a good language model, since it’s easier to find large amounts of pure text training data than it is to find text paired with speech. Therefore, a model for ASR can usually be improved by incorporating a very large language model. 

Adding a language model

After the compression stage, encoder-decoders for speech use either RNNs (LSTMs) or Transformers.

LAS Output

LAS Input and Output Sequences

Listen, Attend, and Spell (LAS) encoder-decoder models for speech are trained using standard cross-entropy loss, similar to conditional language models. At decoding timestep $$i$$, the loss is evaluated as the negative log probability of the correct target token (which is typically a letter), denoted as $$y_i$$.

Training Loss in Listen, Attend, and Spell (LAS)

Feature vector that contains the most commonly used features when converting a raw wavefile to a sequence of features.

Log Mel Spectrum

In feature extraction for ASR, we must extract spectral information from our windowed signal and find out how much energy the signal contains at different frequency bands. This is done with DFT.

Learn Before

Related