Treating multi-class classification directly as a vector-valued linear regression problem (minimizing the difference between raw outputs $$\mathbf{o}$$ and one-hot labels $$\mathbf{y}$$) is unsatisfactory for probability estimation. First, there is no guarantee that the outputs will sum to $$1$$, which is required for a valid probability distribution. Second, there is no guarantee that the outputs will be non-negative or bounded by $$1$$. These limitations render the solution difficult to interpret probabilistically and brittle to outliers.

Limitations of Direct Linear Regression for Classification

To estimate class probabilities across multiple categories, a classification model requires an output for each class. Using a linear model, this is achieved by defining an affine function for each output. For $$d$$ input features and $$q$$ output categories, the unnormalized outputs (logits) $$\mathbf{o}$$ are computed as $$\mathbf{o} = \mathbf{W}\mathbf{x} + \mathbf{b}$$, where the weight matrix $$\mathbf{W} \in \mathbb{R}^{q 	imes d}$$ contains $$q 	imes d$$ scalars and the bias $$\mathbf{b} \in \mathbb{R}^q$$ contains $$q$$ scalars. Because every output depends on every input feature, this computation represents a fully connected layer in a single-layer neural network.

Claude

Because most classification problems lack a natural ordering among classes, categorical labels are typically represented using a one-hot encoding. A one-hot encoding is a vector with as many components as there are categories, where the element corresponding to the true category is set to $$ 1 $$ and all other elements are set to $$ 0 $$. For example, a label for three classes might be represented as the three-dimensional vector $$ (1, 0, 0) $$.

One-Hot Encoding

Dive into Deep Learning

Linear Model for Multi-Class Classification

In language modeling, representing a token by its scalar index is ineffective because numerical proximity does not equate to semantic similarity (for instance, the 45th and 46th words are not necessarily related in meaning). Instead, each token is represented using a one-hot encoding: a vector with a length equal to the vocabulary size, denoted as $$ N $$. In this vector, the entry corresponding to the token's specific index is set to $$ 1 $$, while all other entries are set to $$ 0 $$. For example, with a vocabulary of five elements, the index $$ 2 $$ would be represented as the one-hot vector $$ [0, 0, 1, 0, 0] $$.

One-Hot Encoding for Language Model Tokens

One-hot word vectors are fundamentally limited because they cannot accurately express the semantic similarity between different words. Since the vectors for any two distinct words are orthogonal, their cosine similarity is always $$0$$. Consequently, one-hot vectors completely fail to encode any meaningful relationships or similarities among words, making them a poor choice for word representation.

Learn Before

Related

Learn After