1Cademy - Start of Sentence (SOS) Token

Learn Before

Language Models (LMs)

Definition

Start of Sentence (SOS) Token

The Start of Sentence (SOS) token is a special symbol used in language modeling to indicate the beginning of a text sequence. It is commonly denoted as <s> or <SOS>.

Updated 2025-10-08

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluating language models
Shannon's Foundational Work on Language Modeling
Generalization of the Language Modeling Concept
Chain Rule for Sequence Probability
Deep Learning Approach to Language Modeling
Output Token Sequence in LLMs
Start of Sentence (SOS) Token
[CLS] Token as a Start Symbol
A system is designed to predict the probability of a sequence of words. For the sequence 'The dog ran', the system provides the following conditional probabilities:
- The probability of 'The' occurring at the start of a sequence is 0.2.
- The probability of 'dog' occurring after 'The' is 0.3.
- The probability of 'ran' occurring after 'The dog' is 0.7.
Based on the fundamental principle used by such systems to determine the likelihood of a full sequence, what is the overall probability of the
Analyzing Language Model Probability Assignments
A system's primary goal is to predict the probability of a sequence of tokens. To calculate the total probability for the sequence 'The quick brown fox', it breaks the problem down into a series of conditional probability calculations. Arrange the following calculations in the correct order that the system would use to find the total probability of the sequence.
Evaluating a Language Model's Probabilistic Output
Character-Level Language Model
Types of Language Models

Learn After

A language model is designed to calculate the probability of a sentence by multiplying the conditional probabilities of each word given the words that came before it. For the sentence 'The cat sat', this would be calculated as P('The') * P('cat' | 'The') * P('sat' | 'The cat'). What is the fundamental problem with calculating the probability of the very first word, 'The', in this specific manner?
Applying the Start of Sequence Token
A language model is tasked with calculating the probability of the sentence 'The quick brown fox'. Using the chain rule of probability and a special start-of-sentence token denoted as <s>, how would the model correctly formulate this calculation?

Learn Before

Related

Learn After