Formula for Optimizing Soft Prompts via Context Compression
The optimal soft prompt, denoted as , is determined by minimizing a function that compares the prediction from the full context, , with the prediction from the compressed context, . This function typically represents a loss or similarity measure. The optimization problem is formally expressed as:

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.3 Prompting - Foundations of Large Language Models
Related
Soft Prompt Learning as Context Compression via Knowledge Distillation
Formula for Optimizing Soft Prompts via Context Compression
Alternative Methods for Soft Prompt Optimization
A developer is tasked with creating a compact, learned 'soft prompt' that can effectively replace a very long and detailed set of instructions (the 'full context') for a language model. The objective is to ensure that for any given user query, the model's final output is nearly identical whether it's conditioned on the long instructions or the new compact prompt. Which of the following optimization strategies directly targets this specific objective?
When training a soft prompt to act as a compressed version of a longer context, the primary optimization objective is to ensure the learned soft prompt's vector representation is as close as possible to the vector representation of the original context.
Debugging Soft Prompt Optimization
Interpreting the Soft Prompt Optimization Formula
Formula for Optimizing Soft Prompts via Context Compression
Formula for Soft Prompt Optimization via Log-Likelihood Maximization
Formula for Soft Prompt Optimization by Minimizing KL Divergence
An inference engine using a continuous batching strategy is currently processing a set of text generation requests that fully utilizes its processing capacity. At this point, a new, additional request arrives. What is the most likely immediate action the system's scheduler will take regarding this new request?
A language model is provided with a context
c('Translate the following sentence for a medical professional') and an inputz('Le patient présente une pyrexie'). The model computes the conditional probabilities for several potential English translations (y). Based on the principle of selecting the output that maximizes the conditional probability given the full context and input, which translation should the model choose as its prediction?Analyzing Contextual Influence on LLM Predictions
Formula for Optimizing Soft Prompts via Context Compression
Formula for Soft Prompt Optimization by Minimizing KL Divergence
An LLM is provided with a compressed representation of context, denoted as
σ, and an inputz. The model's goal is to predict the most likely outputy. After processingσandz, the model computes the following conditional probabilities for four possible outputs:- Pr(y='mat' | σ, z) = 0.65
- Pr(y='roof' | σ, z) = 0.25
- Pr(y='sky' | σ, z) = 0.05
- Pr(y='idea' | σ, z) = 0.05
Based on the principle of selecting the output that maximizes the conditional probability, what will the model's final prediction,
ŷ_σ, be?Deconstructing the LLM Prediction Formula
Analyzing an LLM's Incorrect Prediction
Learn After
A machine learning engineer is training a soft prompt, σ, to replace a lengthy context, c. They use the following optimization formula, where s(·,·) is a function measuring the difference between two predictions:
hat(σ) = argmin_σ s(hat(y), hat(y)_σ)Here,
hat(y)is the model's prediction with the full context c, andhat(y)_σis the prediction with the soft prompt σ. After training, the engineer observes that for many inputs, the value ofs(hat(y), hat(y)_σ)is consistently high. What does this observation most directly imply about the outcome of the training process?Impact of the Similarity Function in Soft Prompt Optimization
In the context of learning a compressed representation of a long text, consider the optimization formula:
hat(σ) = argmin_σ s(hat(y), hat(y)_σ), wherehat(y)is the prediction from the full text andhat(y)_σis the prediction from the compressed representationσ. If the functions(·,·)were changed from a dissimilarity measure (e.g., a loss function) to a similarity measure (e.g., a cosine similarity score), theargminoperator should be replaced withargmaxto correctly identify the optimal compressed representationhat(σ).