Learn Before
Calculating a Scaled Attention Score
In a transformer's self-attention layer, a query vector q = [2, 0, 1, 3] is being compared to a key vector k = [1, 2, 2, 1]. The dimensionality of these vectors (d_k) is 4. Based on the standard scaled dot-product attention mechanism, what is the resulting unnormalized attention score? Provide the final numerical answer.
0
1
Tags
Data Science
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An engineer is designing a self-attention layer for a text processing model. They notice that as they increase the dimensionality (
d_k) of the query and key vectors, the training process becomes unstable, and the gradients used for learning become extremely small. Which of the following best explains this phenomenon and the standard solution implemented within the attention mechanism?Calculating a Scaled Attention Score
A transformer's self-attention layer calculates an output vector for each input token. Arrange the following computational steps in the correct sequence to produce a single output vector, based on its query vector and the full set of key and value vectors for the input sequence.
Attention Score in Transformers ()