The Penn Tree Bank (PTB) is a widely used corpus in natural language processing, sampled from Wall Street Journal articles. It is typically divided into training, validation, and test sets. When formatting the dataset for word embedding models, each line represents a sentence with words separated by spaces, allowing individual words to be extracted and processed as discrete tokens. Notably, the original dataset explicitly contains `<unk>` tokens to represent rare or unknown words.

Claude

The word2vec tool maps each word to a fixed-length vector to effectively express similarity and analogy relationships among different words. It comprises two distinct models: the skip-gram model and the continuous bag of words (CBOW) model. To learn semantically meaningful representations, word2vec relies on conditional probabilities, specifically predicting words using their surrounding context in a corpus. Because this supervision is extracted directly from the unlabeled data, word2vec acts as a self-supervised model.

word2vec

Dive into Deep Learning

The classifier for word2vec is a binary logistic regression, which applies sigmoid function.
given a tuple of words$$(w,c)$$, where $$w $$is the target word, $$c $$is one of the context words:
The possibility that $$c$$ is a context word: 
$$
P\left ( +|w,c \right ) = \sigma(c \cdot w)=\frac{1}{1+exp(- \, c  \cdot w)}
$$
The possibility that $$c$$ is not a context word:
$$
P\left ( -|w,c \right ) = 1-P\left ( +|w,c \right )=\sigma(- \,\ c\cdot w)= \frac{1}{1+exp(c \cdot w)}
$$
But there are several context words:
$$
P\left ( +|w,c_{1:L} \right ) = \prod^L_{i=1}\sigma(c_i \cdot w)=\prod^L_{i=1}\frac{1}{1+exp(- \, c_i  \cdot w)}
$$

Classifier for word2vec

- A corpus of text is taken as input for learning the skip-gram embeddings by the learning algorithm.
- A vocabulary of size N is also chosen.
- For each word skip-gram algorithm learns two embeddings target embedding and context embedding.

Learning skip-gram embeddings


Other static embeddings available are
- fasttext
- GloVe

 Other kinds of static embeddings

The skip-gram model is one of the two primary architectures contained within the word2vec tool. It operates on the core assumption that a specific word can be utilized to generate its surrounding context words within a text sequence. By relying on conditional probabilities to predict these context words from a central word in an unlabeled text corpus, it functions as a self-supervised model to generate semantically meaningful, fixed-length word representations.

skip-gram

The continuous bag of words (CBOW) is one of the two core models that make up the word2vec tool. It operates on the foundational assumption that a center word is generated based on its surrounding context words. Functioning as a self-supervised model, it learns semantically meaningful word representations by utilizing conditional probabilities to predict a target word from its surrounding context in unlabeled corpora.

Continuous Bag of Words (CBOW)

Penn Tree Bank (PTB) Dataset

A fundamental preprocessing step in natural language processing involves constructing a vocabulary from a given text corpus. This procedure identifies the unique tokens present in the dataset and establishes a mapping for them. To limit the vocabulary size and manage data sparsity, words that occur less frequently than a specified minimum threshold are excluded from the primary vocabulary and are instead mapped to a common placeholder.

Building a Vocabulary

A context-independent representation of a token $$x$$ is formally defined as a function $$f(x)$$ that takes only $$x$$ as its input, completely ignoring the surrounding text. Consequently, models such as word2vec and GloVe assign a single, fixed pretrained vector to a word regardless of the context in which it appears. This approach has notable limitations when dealing with polysemy and complex semantics; for example, the word "crane" has entirely different meanings in the phrases "a crane is flying" and "a crane driver came", yet a context-independent model assigns the exact same mathematical representation to both instances.

Learn Before

Related