1Cademy - Upgrading a Speech Recognition Architecture

Learn Before

Large End-to-End Neural Networks Can Avoid Representation Limits

Case Study

Upgrading a Speech Recognition Architecture

Case context: You are leading a team building a speech recognition system. Your current system relies heavily on MFCC features and phoneme-based representations, but its accuracy has plateaued well below the optimal error rate. You recently acquired a massive new dataset of audio recordings and perfectly matched transcripts.

Question: Based on end-to-end learning principles, what structural changes should you make to your system's architecture to utilize this new dataset and overcome the current plateau?

Sample answer: The team should shift away from a pipeline that relies on MFCCs and phonemes and instead implement an end-to-end architecture. This requires designing a large-enough neural network that can learn directly from the raw audio inputs to the text outputs, leveraging the massive new dataset to bypass previous representation limitations.

Key points: