Case Study

Upgrading a Speech Recognition Architecture

Case context: You are leading a team building a speech recognition system. Your current system relies heavily on MFCC features and phoneme-based representations, but its accuracy has plateaued well below the optimal error rate. You recently acquired a massive new dataset of audio recordings and perfectly matched transcripts.

Question: Based on end-to-end learning principles, what structural changes should you make to your system's architecture to utilize this new dataset and overcome the current plateau?

Sample answer: The team should shift away from a pipeline that relies on MFCCs and phonemes and instead implement an end-to-end architecture. This requires designing a large-enough neural network that can learn directly from the raw audio inputs to the text outputs, leveraging the massive new dataset to bypass previous representation limitations.

Key points:

  • Adopt an end-to-end learning approach
  • Utilize a large-enough neural network
  • Train directly on the massive new dataset
  • Remove dependency on MFCC/phoneme limitations

Rubric: The response must advise replacing the manual feature pipeline with an end-to-end large neural network trained on the new data.

0

1

Updated 2026-06-13

Contributors are:

Who are from:

Tags

Python Programming Language

Data Science

Machine Learning

Deep Learning

Supervised Learning

Dive into Deep Learning @ D2L

Machine Learning Strategy

Machine Learning Yearning @ DeepLearning.AI