To efficiently calculate the dot products between a center word vector and multiple context or noise vectors in the skip-gram model, deep learning frameworks employ batch matrix multiplication. By permuting the axes of the context and noise word vectors and performing a batch dot product with the center word vectors, the model computes all pairwise dot products simultaneously for the minibatch. This linear algebra implementation step outputs a tensor of shape $$(\text{batch size}, 1, \text{max\_len})$$, representing the prediction scores.

```python
def skip_gram(center, contexts_and_negatives, embed_v, embed_u):
    v = embed_v(center)
    u = embed_u(contexts_and_negatives)
    pred = torch.bmm(v, u.permute(0, 2, 1))
    return pred
```

Batch Matrix Multiplication for Skip-Gram Dot Products

In the forward propagation of the skip-gram model, the input consists of center word indices of shape $$(\text{batch size}, 1)$$ and concatenated context and noise word indices of shape $$(\text{batch size}, \text{max\_len})$$. These two sets of indices are first transformed into dense vectors via an embedding layer. Following this transformation, a batch matrix multiplication is performed between the embedded center words and the embedded context and noise words. This operation returns an output of shape $$(\text{batch size}, 1, \text{max\_len})$$, where each individual element represents the dot product between a center word vector and a context or noise word vector.

Claude

The skip-gram model is one of the two primary architectures contained within the word2vec tool. It operates on the core assumption that a specific word can be utilized to generate its surrounding context words within a text sequence. By relying on conditional probabilities to predict these context words from a central word in an unlabeled text corpus, it functions as a self-supervised model to generate semantically meaningful, fixed-length word representations.

skip-gram

Dive into Deep Learning

An embedding layer operates by mapping a token's integer index $$i$$ directly to the $$i^\textrm{th}$$ row of its learnable weight matrix. The shape of this weight matrix is defined by the dictionary size as the number of rows and the vector dimension as the number of columns. When an embedding layer processes a minibatch of token indices, it retrieves the corresponding vector for each index, effectively appending the vector dimension to the input's shape. For instance, an input of shape $$(2, 3)$$ mapped to a vector dimension of $$4$$ will result in an output tensor of shape $$(2, 3, 4)$$.

Embedding Layer Mapping Token Indices to Vectors

Skip-Gram Forward Propagation Logic

The fastText model introduces a subword embedding approach to address the limitations of traditional word representations. Building upon the skip-gram architecture found in word2vec, fastText calculates the continuous vector for a center word by summing the individual vectors of its constituent subwords.

Learn Before

Related

Learn After