Learn Before
Activity (Process)

Cascading Models at Inference Time

Cascading is an inference-time strategy that employs a sequence of models with increasing complexity and computational cost to process an input. The process begins with a small, computationally cheap model. If this model produces a satisfactory result (e.g., with high confidence), its output is accepted. Otherwise, the input is passed to a larger, more expensive model for a more accurate prediction. This conditional, multi-step approach significantly reduces average computational cost by avoiding the use of the large model for every single input.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences