Multiple Choice

A transformer model calculates the output for a single attention head j at token position i using the formula: $\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\le i}^{[g(j)]}, \mathbf{V}_{\le i}^{[g(j)]})$, where g(j) maps the query head j to a specific group. What is the primary consequence of using the group-indexed key $\mathbf{K}_{\le i}^{[g(j)]}$ instead of a head-specific key $\mathbf{K}_{\le i}^{[j]}$?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science