1Cademy - Compressing a BERT-Based Search Re-Ranker for Edge Deployment Without Losing Domain Coverage

Learn Before

Case Study

Compressing a BERT-Based Search Re-Ranker for Edge Deployment Without Losing Domain Coverage

You lead an applied NLP team deploying a BERT-based cross-encoder re-ranker for an internal enterprise search product that must run on an edge appliance with strict limits: the model artifact (weights) must be ≤ 120 MB and p95 latency must be ≤ 40 ms. The current teacher model is a standard BERT-style encoder trained with masked language modeling and next sentence prediction, then fine-tuned for query–document relevance. It performs well, but the artifact is ~420 MB and latency is too high. Your domain includes many rare product codes and abbreviations (e.g., "ZX-13Q", "A9R-RevB"), and stakeholders report that smaller-vocabulary prototypes sometimes fail to match these terms.

You are considering three student-model design proposals:

A) Keep the same vocabulary as the teacher, reduce embedding size from 768 to 256, and train the student via knowledge distillation from the teacher.

B) Cut the vocabulary size by 60% (more aggressive subword merging), keep embedding size at 768, and train the student via knowledge distillation from the teacher.

C) Keep the same vocabulary and embedding size as the teacher, but use cross-layer parameter sharing so all Transformer layers reuse one set of parameters; then fine-tune directly on relevance labels (no distillation).

Assume the token embedding matrix size scales approximately with |V| × d_e and is a major contributor to total model size, and that distillation is available because you can run the teacher offline during training but not at inference.

Which proposal (A, B, or C) is the best overall choice to meet the deployment constraints while minimizing the risk of losing relevance on rare product codes, and why? Your answer must explicitly connect (1) vocabulary size vs. domain coverage, (2) embedding size effects on parameter/memory footprint, (3) cross-layer parameter sharing effects on capacity/efficiency, and (4) why knowledge distillation changes the expected accuracy of a smaller BERT-style student.

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related