1Cademy - Root-Cause Analysis of Long-Context Degradation After a Positional-Encoding Retrofit

Learn Before

Case Study

Root-Cause Analysis of Long-Context Degradation After a Positional-Encoding Retrofit

You are on-call for an LLM platform team. A decoder-only model originally trained with RoPE for a 4k context window was retrofitted to support 32k tokens without full retraining. The team tried two different retrofits in separate builds:

Build A: Kept RoPE but extended context by scaling the RoPE base ("base scaling"), relying on the idea that a scaled RoPE can be implemented by transforming the effective rotation angles.

Build B: Removed RoPE entirely and instead added a relative position bias term to attention scores.

Observed behavior on internal workloads:

Workload 1 (long legal documents): At 20k–32k tokens, Build A preserves cross-references (e.g., "see Section 2.3" correctly resolves) but becomes noticeably worse at very local syntax/formatting (e.g., JSON and citation punctuation) compared to the 4k baseline.
Workload 2 (chat with tool calls): Build B keeps local formatting stable at 32k, but the model increasingly ignores early instructions and over-attends to recent turns.

Your task: Identify which relative-bias design (ALiBi vs T5-style bucketed relative bias) is more consistent with Build B’s observed failure mode, and then recommend a single change to Build A’s RoPE retrofit (expressed in terms of how positions/angles are mapped, e.g., interpolation via base scaling / angle transformation) that would most plausibly reduce the local-syntax regression while keeping the long-range cross-reference strength. Justify both parts by explicitly linking (1) how RoPE’s rotational mechanism encodes relative distance and how scaling/base changes alter frequency/period behavior, and (2) how ALiBi vs T5 bucketed bias shapes attention as distance grows.

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related