Essay

Selecting a BERT Variant for a Regulated, On-Device Email Classification Feature

Your company is adding an on-device email classification feature (e.g., routing messages into “HR”, “Legal”, “Finance”, “Other”) for a regulated enterprise client. Constraints: (1) the model must run fully offline on employee laptops; (2) the client’s security team requires that the model not store a large, easily-extractable list of sensitive domain terms (e.g., internal project codenames) in a way that increases leakage risk if the model file is copied; (3) latency must be under 50 ms per email on typical hardware; (4) accuracy must remain within 1–2% of your current server-hosted BERT-base classifier.

Write a recommendation memo that proposes a concrete approach to produce an efficient BERT-based encoder for this setting. In your answer, explicitly connect and justify how you would (a) choose or modify the tokenizer/vocabulary size, (b) choose an embedding size, (c) decide whether to use cross-layer parameter sharing, and (d) apply knowledge distillation from a larger teacher model. Your memo must explain the trade-offs among these choices (e.g., how vocabulary size and embedding size affect the embedding matrix footprint and what that implies for both performance and leakage risk; how parameter sharing interacts with student capacity and distillation), and it must end with a clear final design choice and why it best satisfies all four constraints.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Data Science

Foundations of Large Language Models Course

Computing Sciences

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Ch.2 Generative Models - Foundations of Large Language Models

Related