When applying knowledge distillation to compress context into soft prompts, a simple training objective seeks to maximize the log-likelihood of the teacher model's prediction given the compressed representation. This is formalized as $$\hat{\sigma} = \argmax_{\sigma} \log \Pr(\hat{\mathbf{y}}|\sigma, \mathrm{z})$$, where $$\hat{\mathbf{y}}$$ is the prediction from the teacher model using the full context, $$\sigma$$ represents the continuous prompt embeddings, and $$\mathrm{z}$$ is the user input.

Google

Applying knowledge distillation to context compression involves treating the full-context prediction as the teacher model and the compressed-context prediction as the student model. Unlike standard context distillation where the compressed context uses discrete tokens, this method distills the context $$\mathbf{c}$$ into real-valued vectors $$\sigma$$, which act as prompt embeddings. Furthermore, the teacher and student models are not required to share the same architecture; typically, a stronger model serves as the teacher, while a smaller, more efficient model acts as the student.

Context Distillation into Prompt Embeddings

Reference of Foundations of Large Language Models Course

Log-Likelihood Objective for Distilling Context into Soft Prompts

An alternative objective for distilling a full context into continuous soft prompt embeddings is to minimize the Kullback-Leibler (KL) divergence between the output distributions of the teacher and student models. This objective is given by $$\hat{\sigma} = \argmin_{\sigma}\ \mathrm{KL}(\Pr(\cdot|\mathbf{c},\mathbf{z})\ ||\ \Pr(\cdot|\sigma,\mathbf{z}))$$, which directly aligns the student model's probability distribution given the compressed context $$\sigma$$ and input $$\mathbf{z}$$ with the teacher model's distribution given the full context $$\mathbf{c}$$ and input $$\mathbf{z}$$.

KL Divergence Objective for Distilling Context into Soft Prompts

A major limitation of compressing full contexts into continuous representations is the necessity for a teacher model capable of processing the entire, long input sequence. If the context is excessively long, applying standard Large Language Models becomes computationally costly or infeasible. Consequently, this approach often relies on efficient long-context methods, such as utilizing a fixed-size Key-Value (KV) cache or efficient Transformer architectures, to make the teacher model's processing tractable.

Learn Before

Related