Case Study

Analyzing Parameter Impact on Logarithmic Bucketing

Two language models are configured with a mechanism that groups large relative position offsets into a limited number of 'buckets'. Both models use a total of 32 buckets (n_b = 32). For any offset d greater than 16, the bucket index is calculated using the following formula:

b(d) = 16 + floor( (log(d) - log(16)) / (log(dist_max) - log(16)) * 16 )

  • Model A sets its maximum expected offset (dist_max) to 128.
  • Model B sets its maximum expected offset (dist_max) to 512.

Which model provides finer-grained distinctions for offsets between 60 and 120? Explain your reasoning by referencing how the dist_max parameter influences the formula's output.

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science