Case Study

Model Selection for Long-Sequence Tasks

An engineering team has trained two different language models, Model A and Model B, on a dataset where the maximum text length is 512 tokens. The models differ only in how they handle the relative distance between tokens in their attention mechanism:

  • Model A: Learns a unique, independent bias parameter for every possible relative distance it encountered during training (i.e., a separate parameter for distance 1, distance 2, ..., up to distance 511).
  • Model B: Learns unique parameters for small, common distances but groups larger distances into 'buckets'. All distances within a single bucket (e.g., all distances from 64 to 95) share a single, common bias parameter.

The team now needs to deploy one of these models for a task that involves processing documents up to 2000 tokens long. Which model should they choose, and why?

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science