Short Answer

Interpreting Attention Matrix Structures

An engineer is debugging two attention-based models processing a long document. They visualize the attention weight matrix, α, for a specific query token from each model.

  • Model 1 produces a matrix where nearly every cell has a non-zero value, indicating that the query token is attending, to some degree, to every other token in the document.
  • Model 2 produces a matrix where only a small, localized subset of cells have non-zero values; the vast majority are zero.

Based on these observations, contrast the two models in terms of (1) the computational resources required to calculate the output and (2) how the contextual representation for the query token is formed.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science