Learn Before
Decoder-Only Transformer as a Language Model
The decoder-only Transformer architecture is a prevalent design for Large Language Models (LLMs). It is typically created by modifying a standard Transformer decoder, specifically by eliminating the cross-attention sub-layers. The central components of this architecture are stacked Transformer blocks, each comprising a self-attention sub-layer and a feed-forward network (FFN) sub-layer. To prevent the model from accessing the right-context, a masking variable is incorporated into the self-attention mechanism. Finally, the output layer uses a Softmax function to generate a probability distribution for the next token, given the sequence of previous tokens, enabling auto-regressive generation.
0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.2 Generative Models - Foundations of Large Language Models
Ch.5 Inference - Foundations of Large Language Models
Related
Decoder-Only Transformer as a Language Model
An engineering team is tasked with building a system to perform sentiment analysis on customer reviews. The goal is to classify each review as 'positive', 'negative', or 'neutral'. For an accurate classification, the model must be able to understand the full context of the entire review, including how words at the end of a sentence can influence the meaning of words at the beginning. Which of the following architectural approaches is best suited for this specific task?
You are a machine learning engineer evaluating different model architectures for three distinct natural language processing projects. Match each project description with the most suitable architectural approach based on its core requirements.
Architectural Design for a Creative Writing Assistant
Architectural Choice for Document Summarization
Learn After
Training Decoder-Only Language Models with Cross-Entropy Loss
Output Probability Calculation in Transformer Language Models
Global Nature of Standard Transformer LLMs
Processing Flow of Autoregressive Generation in a Decoder-Only Transformer
Initial Input Representation for Transformer Layers
Greedy Decoding in Language Models
Structure of a Transformer Block
A generative language model is designed to produce text by predicting the next token based solely on the sequence of tokens that came before it. If you were to adapt a standard Transformer decoder block for this specific auto-regressive task, which of its sub-layers would you remove, and why is this modification functionally necessary?
A language model is constructed using a stack of modified Transformer decoder blocks. Each block contains a self-attention sub-layer and a feed-forward network sub-layer, but lacks the sub-layer that would process information from a separate, secondary input sequence. This model is capable of performing a machine translation task, such as translating a German sentence into English, without any further architectural changes.
Function of Self-Attention in Auto-regressive Generation
Neural Network-Based Next-Token Probability Distribution