Learn Before
Concept

Compression in LAS

The encoder-decoder architecture is very appropriate when input and output sequences have significant difference in length, and they do for speech, since long acoustic feature sequences tend to map to much shorter sequences of letters or words. For example, a single word might be 5 letters long, but if it is spoken for 2 seconds, it would take 200 acoustic frames of 10 milliseconds each to map to. Due to this extreme length difference in speech, the encoder-decoder architectures for speech require a special compression stage that shortens the acoustic feature sequence before the stage of encoding. An alternative to this is using a loss function that deals with compression, like the CTC or connectionist temporal classification loss function.

0

1

Updated 2022-05-08

Tags

Deep Learning (in Machine learning)

Data Science

Related
Learn After