Learn Before
Short Answer

Rationale for Causal Mask Values

In a self-attention mechanism designed for sequential data processing (like generating text), a mask matrix is added to the raw attention scores before a normalization step. This matrix uses values of 0 for positions a token is allowed to attend to, and negative infinity (-∞) for positions it is forbidden from attending to. Explain precisely why negative infinity is used for the forbidden positions and what effect this has on the final, normalized attention weights.

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science