The standard encoder-decoder architecture is which is generally called the attention-based encoder decoder or AED, or listen attend and spell (LAS). The standard encoder decoder architecture for ASR is sketched by the following image, which contains the schematic architecture for an encoder-decoder speech recognizer. 

Listen attend and spell (LAS)


The architecture for ASR is the encoder-decoder which is implemented with recurrent neural networks (LSTMs) or transformers. For this architecture, it is common to begin with log mel spectral features and map to letters. It’s also possible to sometimes map to induced morpheme-like chunks, such as word pieces or BPE (byte-pair encoding).

Columbia University

Automatic speech recognition is a sequence-to-sequence learning task where the input sequence is an audio recording of a speaker and the output is a text transcript of the spoken words. A significant challenge in this domain is that there is no one-to-one correspondence between audio frames and text, as thousands of audio samples may correspond to a single word, making the input sequence much longer than the output sequence.

Automatic Speech Recognition

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Speech and Language Processing (3rd ed. draft) 

Vocabulary Size, Who the speaker is talking to, Channal and Noise, and Speaker-Class Characteristics

Dimensions of ASR Variation

LibriSpeech is a large open-source read-speech 16 kHz dataset with over 1000 hours of audio books from the LibriVox project.

LibriSpeech

The Switchboard corpus of prompted telephone conversations between strangers
was collected in the early 1990s.

The Switchboard corpus

The CALLHOME corpus was collected in the late 1990s and consists of 120 unscripted 30-minute telephone conversations between native speakers of English who were usually close friends or family.

The CALLHOME corpus

CORAAL is a collection of over 150 sociolinguistic interviews with African American speakers, with the goal of studying African American Language (AAL).

CORALL

The CHiME Challenge is a series of difficult shared tasks with corpora that deal with robustness in ASR.

CHiME

The HKUST Mandarin Telephone Speech corpus has 1206 ten-minute telephone conversations between speakers of Mandarin across China, including transcripts of the conversations, which are between either friends or strangers.

The HKUST Mandarin Telephone Speech corpus

The AISHELL-1 corpus contains 170 hours of Mandarin read speech of sentences taken from various domains, read by different speakers mainly from northern China.

The AISHELL-1 corpus

The first step in ASR is to transform the input waveform into a sequence of acoustic feature vectors, where each vector represents the information in a small time window of the signal.

Feature Vector

The conversion from the analog representations (first air pressure and then analog electric signals in a microphone) into a digital signal.

Analog-to-digital Conversion

Automatic Speech Recognition Architecture

It's the standard evaluation metric for speech recognition systems. The word error rate is based on how much the word string returned by the recognizer (the hypothesized word string) differs from a reference transcription.

Word Error Rate

The MAPSSWE test is a parametric test that looks at the difference between the number of word errors the two systems produce, averaged across a number of segments.

The Matched-Pair Sentence Segment Word Error (MAPSSWE) test

In statistics, McNemar's test is a statistical test used on paired nominal data. It is applied to $2 \times 2$ contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal (that is, whether there is "marginal homogeneity").

McNemar's test

McNemar’s test is only applicable when the errors made by the system are independent, which is not true in continuous speech recognition, where errors made on a word are extremely dependent on errors made on neighboring words.

Disadvantage of McNemar's test

Intelligent assistants—such as Siri, Alexa, and Google Assistant—utilize artificial intelligence to accurately respond to spoken requests. A key driver behind their widespread adoption is advanced automatic speech recognition technology, which has improved sufficiently to achieve human parity in certain applications, enabling these systems to manage everything from simple smart home controls to complex conversational support.

Intelligent Assistants

Acoustic waves can be mathematically represented using sine and cosine functions:

$$y = A \sin(2\pi ft)$$

Two key characteristics of an acoustic wave are its amplitude ($$A$$) and frequency ($$f$$).

Learn Before

Related

Learn After