1Cademy - Length-Aware Context Selection and Compression for RAG

Learn Before

Original RAG Sequence-Generation Framework (Lewis et al., 2020)

Concept

Length-Aware Context Selection and Compression for RAG

Length-aware context selection and compression is the line of retrieval-augmented generation (RAG) work that treats the token budget of the generator's input as a first-class constraint and selects, truncates, or compresses retrieved context to fit it. Two complementary traditions are typically cited together. Knapsack-style selection (e.g., Riedhammer et al., Interspeech 2008) formulates the choice of which retrieved units to include under a length cap as a 0/1 knapsack-packing problem, maximizing a utility (such as expected ROUGE or relevance) subject to a token-budget constraint. Prompt compression methods such as LLMLingua (Jiang et al., EMNLP 2023) and LongLLMLingua (Jiang et al., ACL 2024) instead compress the assembled prompt token-by-token under an explicit budget controller, with LongLLMLingua adding question-aware compression and dynamic per-document compression ratios for RAG. The common motivation is that retrieving a fixed number of passages does not imply a fixed number of tokens passed to the generator, so practical RAG deployments must apply a separate length-aware policy on top of top- $k$ retrieval. This is the body of work that papers cite when they argue that token-cap effects are a distinct axis from retrieval ranking and should be analyzed separately from ordering and serialization effects.

Updated 2026-05-18

Contributors are:

Who are from:

References

Reference: LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

Learn After

Token-Cap Diagnostics Separated from Headline Comparisons

Learn Before

Related

Learn After