Essay

Designing a Mobile-Deployable BERT Encoder Under Tight Memory and Latency Constraints

Your company wants to deploy an on-device text understanding feature (intent classification + entity extraction) in a mobile app. The current server-side solution uses a standard BERT-style encoder pre-trained with masked language modeling (and optionally next sentence prediction) and fine-tuned for the tasks, but it is too large and slow for the phone. You are given a hard constraint of 60 MB total model storage and a strict latency budget, and you must propose a plan to produce a smaller BERT-like model while preserving as much accuracy as possible.

Write a recommendation memo that (1) proposes a concrete compression strategy that combines at least three of the following levers: vocabulary size, embedding size, cross-layer parameter sharing, and knowledge distillation; (2) explains how changing vocabulary size and embedding size affects the embedding matrix size and downstream representational capacity; (3) explains how cross-layer parameter sharing changes parameter count and what accuracy/expressivity risk it introduces in a deep encoder; and (4) explains how you would use a large teacher BERT to distill into your smaller student and why distillation can partially offset the accuracy losses introduced by the other size-reduction choices. Your answer should make explicit trade-offs (what you gain/lose) and justify why your combined design is appropriate for on-device deployment.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Data Science

Foundations of Large Language Models Course

Computing Sciences

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Ch.2 Generative Models - Foundations of Large Language Models

Related