Learn Before
A developer is building a model designed to generate text sequentially, where each new word is predicted based on the words that came before it. They consider modifying the model by removing the specific constraint that prevents a position in the sequence from attending to subsequent positions. What is the most likely consequence of this change on the model's training and generation capabilities?
0
1
Tags
Data Science
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Core Components of a Transformer Decoding Network
Masked Self-Attention in Transformer Decoders
A developer is building a model designed to generate text sequentially, where each new word is predicted based on the words that came before it. They consider modifying the model by removing the specific constraint that prevents a position in the sequence from attending to subsequent positions. What is the most likely consequence of this change on the model's training and generation capabilities?
A standard Transformer decoder block contains two distinct attention sub-layers. Which statement accurately differentiates the roles and data sources for these two sub-layers?
Within a single decoder block of a standard Transformer architecture, information is processed through three main computational sub-layers. Arrange these sub-layers in the correct operational sequence.