Learn Before
Dot Attention
This is the first type of multiplicative attention. This is the easiest option to implement and it is based on dot similarity. Geometrically, the dot product depends on the angle in between the vectors. The intuition behind it is that if vectors have similar direction then dot product will be higher and further the direction goes smaller the dot similarly will become. For example dot product of orthogonal vectors will be zero. And in this case the encoder vectors that are more related to the decoder vector will get bigger scores. One disadvantage for this is that it does not use any learning while other types utilize some kind of learning:
.
- encoder - decoder
0
1
Contributors are:
Who are from:
Tags
Data Science
Foundations of Large Language Models Course
Computing Sciences
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Learn After
Example of Predicting Masked Words: Kitten Playing
Example of Masked Language Modeling: Kitten Chasing Ball
Example of Context-Based Prediction: Kitten Chasing Ball
In a sequence-to-sequence model, an attention mechanism calculates a score for three input vectors (A, B, and C) relative to a single output vector (D). The scoring function is the simple dot product between the output vector and each input vector. You are given the following geometric relationships:
- Vector A points in a very similar direction to Vector D.
- Vector B is orthogonal (at a 90-degree angle) to Vector D.
- Vector C points in the opposite direction of Vector D.
Which input vector will receive the highest attention score, and what is the underlying reason for this?
Evaluating Attention Mechanisms in Machine Translation
Calculating a Dot Attention Score