Case Study

Choosing Between Prompt Tuning and Prefix Fine-Tuning for a Latency-Critical, Multi-Task LLM Service

You run an internal LLM platform that serves 30 department-specific tasks (e.g., HR policy Q&A, procurement clause extraction, IT ticket triage) from a single frozen base model. Each task must be deployable as a small “adapter artifact” (you cannot store 30 full model copies). Two additional constraints have emerged:

  1. The inference stack is standardized and cannot be modified to add new per-layer inputs or change Transformer internals; only the request payload (tokens/embeddings at the model input) can vary by task.
  2. A new product requirement demands that the adapter be as small as possible and that per-request latency overhead be minimal.

A senior engineer proposes using prefix fine-tuning because it is also parameter-efficient and often performs well. Another proposes prompt tuning with continuous (soft) prompts.

As the technical decision-maker, which approach should you choose and why? In your answer, explicitly connect (a) what “parameter-efficient fine-tuning” means in this context, (b) how continuous/soft prompts are represented and trained, and (c) where the trainable vectors are composed into the model’s computation (e.g., at the embedding input vs concatenated into each layer’s input matrix as (\mathbf{H}^l = [\mathbf{p}_0^l \dots \mathbf{p}_n^l ; \mathbf{h}_0^l \dots \mathbf{h}_m^l])).

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Data Science

Related