Subsampling achieves the goal of producing a shorter sequence $X = x_1,...., x_n$ that serves as the input to the encoder. The simplest algorithm is a method sometimes called low frame rate.

Subsampling in LAS

The encoder-decoder architecture is very appropriate when input and output sequences have significant difference in length, and they do for speech, since long acoustic feature sequences tend to map to much shorter sequences of letters or words. For example, a single word might be 5 letters long, but if it is spoken for 2 seconds, it would take 200 acoustic frames of 10 milliseconds each to map to. Due to this extreme length difference in speech, the encoder-decoder architectures for speech require a special compression stage that shortens the acoustic feature sequence before the stage of encoding. An alternative to this is using a loss function that deals with compression, like the CTC or connectionist temporal classification loss function.

Columbia University

The standard encoder-decoder architecture is which is generally called the attention-based encoder decoder or AED, or listen attend and spell (LAS). The standard encoder decoder architecture for ASR is sketched by the following image, which contains the schematic architecture for an encoder-decoder speech recognizer. 

Listen attend and spell (LAS)


An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Speech and Language Processing (3rd ed. draft) 

The input of LAS is a sequence of $t$ acoustic feature vectors $F = f_1, f_2, …, f_t$ where one vector is spanning each frame of 10 milliseconds. Assuming the output as letters,  the output sequence  $Y = (\langle SOS \rangle, y_1, ..., y_m ,\langle EOS \rangle)$, assuming each $\langle SOS \rangle$ to be a special start. ofspeech token and each $\langle EOS \rangle$ to be a special end of speech token. The following image shows the set we might select for the output if we are considering the English language 

Input to LAS

Compression in LAS

An encoder-decoder model is basically a conditional language model, so encoder-decoders implicitly must learn a language model for the output domain of letters from training data. However, the training data, which is generally speech paired with text transcriptions, may not include as much text as you need to train a good language model, since it’s easier to find large amounts of pure text training data than it is to find text paired with speech. Therefore, a model for ASR can usually be improved by incorporating a very large language model. 

Adding a language model

Encoder-decoders for speech are trained with normal cross-entropy loss, which is typically used for conditional language models. At timestep $i$ of decoding, the loss is the log probability of the correct token (which is a letter) $y_i$. 

Learning for LAS

After the compression stage, encoder-decoders for speech use either RNNs (LSTMs) or Transformers.

Learn Before

Related

Learn After