handout document

note 1: all vectors are in column format. for example, each one-hot vector is $\mathbf{R}^{|V|\times 1}$, and each word vector is $\mathbf{R}^{d\times 1}$, where $|V|$ is the size of the vocabulary, and $d$ is the dimension of word vector

note 2: $v_c$, $u_o$ were regarded as vectors when I did computing. it is possible that something would go wrong when doing mini-batch, which would stacks word vectors to be a matrix.

(a)

(b)

notice that

(c)

when $w=o$

when $w\not=o$

in summary,

(d)

let

then

so

which is

(e)

first, we compute $\frac{\partial J}{\partial v_c}$.

similarily,

therefore,

the time complexity of computing $\frac{\partial J}{\partial v_c}$ is $O(dk)$. while computing $\frac{\partial J}{\partial v_c}$ for softmax is $O(|V|)$, which is much larger than $O(dk)$.

in the same way, we can get

the time complexity for computing the gradient for a $u$ is $O(d)$, which is much smaller than that of softmax $O(|V|)$.

in summary, the negative sampling version of word2vec is much more computational efficient than the softmax version.

(f)

(i)

since

we can imply that

(ii)

so

(iii)

$v_w$ is not related to $v_c, U$, so