Multiple Choice

A language model was trained exclusively on text segments with a maximum length of 512 tokens. During inference, it must process a 1000-token document, encountering a query-key offset of 700 for the first time. Why is a model architecture that groups offsets into 'buckets' and shares a single learnable parameter per bucket better equipped to handle this novel offset than a hypothetical model that learns a unique, separate parameter for every individual offset?

0

1

Updated 2025-10-01

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science