word2vec

Tags

main()

vocab = (struct vocab_word *)calloc(vocab_max_size, sizeof(struct vocab_word));

allocate memory space to maintain vocabulary of handled corpus. calloc will save you a call of memset.

expTable = (real *)malloc((EXP_TABLE_SIZE + 1) * sizeof(real));

for (i = 0; i < EXP_TABLE_SIZE; i++) {
expTable[i] = exp((i / (real)EXP_TABLE_SIZE * 2 – 1) * MAX_EXP); // Precompute the exp() table
expTable[i] = expTable[i] / (expTable[i] + 1); // Precompute f(x) = x / (x + 1)
}

prepare a table to look up values of sigmoid function quickly.

2. TrainModel()

if there is a vocabulary file prepared, read it by ReadVocab(), else call LearnVocabFromTrainFile()
then initialize parameters by calling InitNet()
invoke multi-threads and begin training:

pthread_t *pt = (pthread_t *)malloc(num_threads * sizeof(pthread_t));

for (a = 0; a < num_threads; a++) pthread_create(&pt[a], NULL, TrainModelThread, (void *)a);
for (a = 0; a < num_threads; a++) pthread_join(pt[a], NULL);

TrainModelThread()

if sentence_length == 0, read a sentence from file into buf of this thread. subsample frequent words and increase word_count during the reading.
if word_count is larger than the nb. of words for which one thread should be responsible for, break
choose the true window size with a uniform distribution {1,2,3,4,5} (b%window_size where windwos_size=5} so that weight positions.
sentence_position is current word (given word = sentence[sentence_position]), word in context to be predicted is last_word (last_word=sentence[c]) and the beginning index of its embedding is stored in l1.
consider negative sampling:

if (negative > 0) for (d = 0; d < negative + 1; d++) {
if (d == 0) {
target = word;
label = 1;
} else {
next_random = next_random * (unsigned long long)25214903917 + 11;
target = table[(next_random >> 16) % table_size];
if (target == 0) target = next_random % (vocab_size – 1) + 1;
if (target == word) continue;
label = 0;
}
l2 = target * layer1_size;
f = 0;
for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];
if (f > MAX_EXP) g = (label – 1) * alpha;
else if (f < -MAX_EXP) g = (label – 0) * alpha;
else g = (label – expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
}

d=0 is positive example, so label = 1

negative is sampled from table which is $Unigram^{\frac{3}{4}}$

l2 = the beginning index of word embedding

g = learning rate alpha multiplied by the derivation of sigmoidal function

immediately update the gradient to negative embeddings but sum up all negative embeddings for gradient of positive embedding and then update it

learning rate is decreasing during the training procedure:

alpha = starting_alpha * (1 – word_count_actual / (real)(train_words + 1));

3. ReadVocab()

read word and add it to vocabulary by AddWordToVocab()
then all words need to be sort by SortVocab()

4. AddWordToVocab()

if the number of vocabulary is near the limit, reallocate memory space to hold it:

if (vocab_size + 2 >= vocab_max_size) {
vocab_max_size += 1000;
vocab = (struct vocab_word *)realloc(vocab, vocab_max_size * sizeof(struct vocab_word));
}

with realloc, original content will be preserved but newly part is indeterminate.

5. SortVocab()

sort according to nb. of occurrences of words in descending order
hash needs re-computing and too rare words are discarded

6. InitNet()

a = posix_memalign((void **)&syn0, 128, (long long)vocab_size * layer1_size * sizeof(real));

allocate memory space for word embeddings, default dimension is 100 (i.e., layer1_size=100).

int posix_memalign(void **memptr, size_t alignment, size_t size);

it is included in stdlib.h and allocate size bytes aligned on a boundary specified by alignment, and shall return a pointer to the allocated memory in memptr.

if (negative>0) {
a = posix_memalign((void **)&syn1neg, 128, (long long)vocab_size * layer1_size * sizeof(real));
if (syn1neg == NULL) {printf(“Memory allocation failed\n”); exit(1);}
for (b = 0; b < layer1_size; b++) for (a = 0; a < vocab_size; a++)
syn1neg[a * layer1_size + b] = 0;
}

consider the negative sampling case, the embedding of a word in negative term is different from its positive term, say that calculate with syn1neg instead of syn0

Data Structures:

struct vocab_word {
long long cn;
int *point;
char *word, *code, codelen;
};

cn records the number of occurrences of the word in vocabulary

Jones Wong is on the way.

~ Machine Learning is shaping our world.

word2vec

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply