Case Study

Choosing a Positional Embedding Generalization Strategy

A team is adapting a language model, originally trained on sequences up to 2048 tokens, for a new task involving legal document analysis where documents can be up to 4096 tokens. They are debating two general approaches for handling positional information for these longer sequences:

  • Approach A: Rescale the positional indices of the 4096-token sequence to fit within the original 0-2047 range.
  • Approach B: Use the learned pattern from the 0-2047 range to mathematically generate new positional values for the 2048-4095 range.

Evaluate the primary trade-off between these two approaches regarding the model's ability to perceive the relative distance between tokens. Which approach is more likely to preserve high-resolution detail about the proximity of adjacent tokens, and why?

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science