Learn Before
A developer is implementing an attention layer for a model that incorporates positional information by adding a learnable scalar bias based on the relative distance between tokens. Given a query vector q_i for a token at position i, a key vector k_j for a token at position j, a key dimension d_k, and the specific learnable bias u_{b(i-j)} for their relative position, which of the following expressions correctly computes the unnormalized attention score (the value passed into the softmax function) for this architectural design?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A developer is implementing an attention layer for a model that incorporates positional information by adding a learnable scalar bias based on the relative distance between tokens. Given a query vector
q_ifor a token at positioni, a key vectork_jfor a token at positionj, a key dimensiond_k, and the specific learnable biasu_{b(i-j)}for their relative position, which of the following expressions correctly computes the unnormalized attention score (the value passed into the softmax function) for this architectural design?Analysis of T5 Attention Formula Modifications
Evaluating the Design of T5's Unscaled Attention Mechanism