Multiple Choice

A researcher implements a modified attention mechanism where the learnable scalar bias, based on relative position, is applied after the query-key dot product is scaled. The formula used is: α(i,j)=Softmax(qiTkjd+ub(ij)+Mask(i,j))\alpha(i, j) = \text{Softmax}\left( \frac{q_i^T k_j}{\sqrt{d}} + u_{b(i-j)} + \text{Mask}(i, j) \right) What is the most significant consequence of this specific modification compared to the standard approach of adding the bias before scaling?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science