Case Study

Selecting an Efficient BERT Variant for a Domain-Specific Contract Clause Classifier

You lead an NLP team building an internal contract-clause classifier (12 labels) for a legal department. The model must run in a CPU-only batch pipeline that processes 2 million clauses overnight. You have a hard cap of 450 MB for the model artifact in the container image, and inference throughput is currently the bottleneck. You can pre-train/fine-tune on your company’s corpus, but labeled data is limited (about 30k clauses). The baseline is a standard BERT encoder fine-tuned for classification.

You are considering three redesign options: A) Keep the same number of Transformer layers, but increase the WordPiece vocabulary substantially to better cover legal terms. B) Keep the vocabulary the same, but reduce embedding size and use cross-layer parameter sharing (reuse one Transformer layer’s parameters across the full stack). C) Train a smaller student encoder using knowledge distillation from your current fine-tuned BERT teacher, while also modestly reducing embedding size (vocabulary unchanged).

Which option would you recommend and why? In your answer, explicitly connect (1) how vocabulary size and embedding size affect the parameter/memory footprint, (2) how cross-layer parameter sharing changes model size and potential accuracy, and (3) why knowledge distillation is or is not the best fit given limited labeled data and the need to preserve BERT-like bidirectional understanding for clause classification.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Data Science

Foundations of Large Language Models Course

Computing Sciences

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Ch.2 Generative Models - Foundations of Large Language Models

Related