Multiple Choice

In a causal self-attention mechanism, a linear relative position bias is added to the attention scores. The bias for a query at position 'i' attending to a key at position 'j' is calculated as B = -β * (i - j) for j ≤ i, where β is a positive scalar. How would the attention behavior of a model using a large positive β value (e.g., β = 1.0) compare to a model using a small positive β value (e.g., β = 0.1)?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science