Essay

Choosing a BERT Compression Strategy for an On-Prem Document Triage System

Your company is deploying an on-prem document triage system that uses a BERT-style encoder (trained with masked language modeling and next sentence prediction) to classify and route internal documents. The system must run on a fixed CPU-only server with a strict RAM cap, but latency is less critical than maintaining classification quality on domain-specific terminology (product codenames, acronyms, and part numbers). You are allowed to change (a) the vocabulary size, (b) the embedding size, (c) whether Transformer layers share parameters across the stack, and (d) whether to train a smaller student model via knowledge distillation from a large in-house teacher.

Write a recommendation memo that proposes a coherent design (not just independent tweaks) and defends it. Your answer must explicitly explain how your choices interact—for example, how vocabulary size and embedding size jointly affect the embedding matrix memory footprint and representation capacity, how cross-layer parameter sharing changes parameter count and may affect expressiveness, and how knowledge distillation can (or cannot) compensate for capacity reductions. Conclude with the key risks you would monitor in evaluation (e.g., failure modes on rare domain terms) and why those risks follow from your design.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Data Science

Foundations of Large Language Models Course

Computing Sciences

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Ch.2 Generative Models - Foundations of Large Language Models

Related