Short Answer

Generalization of Relative Positional Bias

A transformer-based model is trained exclusively on text sequences with a maximum length of 512 tokens. This model uses a relative positional encoding scheme where different query-key offsets are grouped into a limited number of 'buckets', and each bucket shares a single learnable bias parameter. During inference, the model is tasked with processing a document that is 1000 tokens long. Explain how this bucketing strategy enables the model to compute meaningful attention scores for token pairs with relative distances (e.g., -600, 750) that were never encountered during the training phase.

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Related