Learn Before
Concept
Tricks Used in Word2Vec
In this node, a discussion about word2vec's two important tricks is provided. Additionally, a explaination about why these two tricks are needed is provided as well.
- Hierarchical Softmax As known, word2vec uses the hidden layer gotten by multiplied with the W matrix as the representation vectors of words. For a naive model, we calculate the average of the word vectors as an output. As a result, we need do it for every word and do normalization on it. It's very time consuming. Hierarchical softmax, instead, provides a Huffman tree based model. And for every node, we define a logistic regression which decides the following direction.(1 or 0 means left or right.) And by considering the path as products of the possibilities, we could calculate the probabilities of every word reaching the leaf node. And we calculate every partial derivatives for every node going through by a path sample. Finally, we update the parameters according to the derivatives. It highly reduces the operand.
2.Negative Sampling This trick is provided for solving the problem of sampling and dealing with the commonly used words with no sense, like . For every words in the samples, there exists a probability of deleting the relating pairs positively related with the frequency it shows up. And five negative word pairs are randomly provided for every sample for updating. What's interesting? The possibility of a word selected as a negative word is also related with the frequency it shows up.
0
2
Updated 2021-03-14
Tags
Data Science