Case Study

Post-Deployment PEFT Choice and Prefix Input Composition for a Multi-Tenant LLM Service

You run a multi-tenant internal LLM platform where one frozen base model serves 30 business units. Each unit needs a task-specific adaptation that can be swapped at request time with minimal latency and minimal per-task storage. Your inference stack is standardized and you are allowed to (a) prepend trainable continuous vectors only at the input embedding layer, or (b) modify the model so that each Transformer layer receives additional trainable vectors prepended to that layer’s input. A new requirement arrives: several units want to update their adaptation weekly based on fresh labeled data, but the platform team also needs a clear way to reason about what is being trained and where it is injected to debug occasional regressions.

Given these constraints, choose which approach you would deploy (input-embedding-only prompt tuning vs per-layer prefix fine-tuning) and justify your choice by explicitly explaining (1) how the trainable continuous prompts relate to PEFT goals, and (2) how the per-layer input matrix is composed in the per-layer approach (i.e., what is concatenated with what, and which part is trainable vs frozen).

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Data Science

Related