note 1: all vectors are in column format. for example, each one-hot vector is $\mathbf{R}^{|V|\times 1}$, and each word vector is $\mathbf{R}^{d\times 1}$, where $|V|$ is the size of the vocabulary, and $d$ is the dimension of word vector
note 2: $v_c$, $u_o$ were regarded as vectors when I did computing. it is possible that something would go wrong when doing mini-batch, which would stacks word vectors to be a matrix.
(a)
(b)
notice that
(c)
when $w=o$
when $w\not=o$
in summary,
(d)
let
then
so
which is
(e)
first, we compute $\frac{\partial J}{\partial v_c}$.
similarily,
therefore,
the time complexity of computing $\frac{\partial J}{\partial v_c}$ is $O(dk)$. while computing $\frac{\partial J}{\partial v_c}$ for softmax is $O(|V|)$, which is much larger than $O(dk)$.
in the same way, we can get
the time complexity for computing the gradient for a $u$ is $O(d)$, which is much smaller than that of softmax $O(|V|)$.
in summary, the negative sampling version of word2vec is much more computational efficient than the softmax version.
(f)
(i)
since
we can imply that
(ii)
so
(iii)
$v_w$ is not related to $v_c, U$, so