Essay

Selecting Prompt Tuning vs Prefix Fine-Tuning by Reasoning from Where Soft Prompts Enter the Transformer

You are advising an internal platform team that must support 30 task-specific adaptations of the same frozen LLM (e.g., policy QA, ticket triage, contract clause extraction). The serving stack is standardized: it can easily prepend extra vectors to the input embedding sequence before layer 1, but it is costly to modify the model graph to inject additional vectors into every Transformer layer. The team is considering two PEFT options that both use continuous (soft) prompts: (A) prompt tuning, where a learned sequence of soft prompt vectors is prepended only at the embedding layer; (B) prefix fine-tuning, where each layer l receives an input matrix H^l formed by concatenating trainable prefix vectors p_0^l..p_n^l with the previous layer’s hidden states h_0^l..h_m^l (i.e., H^l = [p^l ; h^l]).

Write a recommendation memo that (1) explains, using the H^l composition above, what architectural/serving changes prefix fine-tuning implies compared with prompt tuning, (2) analyzes how those changes affect per-request latency and operational complexity when hosting many tasks, and (3) justifies which approach you would choose if the primary business goal is to minimize deployment friction while still staying within the PEFT philosophy of updating only a small number of parameters. Your answer should explicitly connect the “where the soft prompts live” (embedding-only vs every layer) to the trade-offs you claim.

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Data Science

Related