1Cademy - Length-Aware Selection and Truncation for Token-Budgeted RAG

Learn Before

Budgeted Context Construction And Reproducibility (Related Work) in Auditable Strict-Parity Evaluation of Prerequisite-Graph Retrieval for RAG under Leakage Controls

Concept

Length-Aware Selection and Truncation for Token-Budgeted RAG

Length-aware selection and truncation for token-budgeted RAG is a strand of prior work that treats the construction of an LLM's input context under a hard length or token budget as an explicit constrained selection / compression problem, rather than as fixed- $k$ retrieval.

Three representative formulations define this strand:

Knapsack-style extractive selection (Riedhammer et al., 2008). Length-constrained summarization is cast as a knapsack-packing problem: choose a subset of candidate units (e.g., sentences or utterances) that maximizes a utility score subject to a hard length cap. The problem is NP-hard but admits near-optimal heuristic and global-optimization solvers. This is the canonical formulation of budgeted selection under a length constraint.
Prompt compression for LLM inference (LLMLingua; Jiang et al., 2023). Given a target token budget, a budget controller allocates compression ratios across prompt components, and an iterative token-level compression procedure shortens the prompt while preserving the information most relevant to the downstream task. This treats budget satisfaction as a compression problem layered on top of selection.
List-adaptive retrieval truncation for RAG (Xu et al., 2024). A joint reranking-and-truncation model dynamically chooses the cut-off point of the retrieved list per query, trading off coverage of relevant material against the risk of feeding misinformation or noise into the generator. This treats budget satisfaction as a learned, query-conditioned truncation policy on the retrieved list.

The shared commitment across these formulations is that the budget is a first-class constraint: the system explicitly decides what to keep, what to drop, and what to compress in order to respect it. A new method in this strand would normally be evaluated by how well its selection or compression policy preserves downstream task quality under matched token budgets, against knapsack, compression, and learned-truncation baselines.

Updated 2026-05-18

Contributors are:

Who are from:

References

Reference: LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

Learn After

Token-Cap Analysis Isolates Ordering and Serialization Effects (Auditable Strict-Parity Graph-RAG Paper)

Learn Before

Related

Learn After