CS224n Assignment Answer 2 | TWO2TWO2TWO2TWO2

handout document

note 1: all vectors are in column format. for example, each one-hot vector is $\mathbf{R}^{|V|\times 1}$, and each word vector is $\mathbf{R}^{d\times 1}$, where $|V|$ is the size of the vocabulary, and $d$ is the dimension of word vector

note 2: $v_c$, $u_o$ were regarded as vectors when I did computing. it is possible that something would go wrong when doing mini-batch, which would stacks word vectors to be a matrix.

(a)

$\begin{aligned}-\sum_{w\in V_{ocab}}y_w\log(\hat{y}_w)&=-\sum_{w\in V_{ocab},w\not=o}0\times \log(\hat{y}_w)-1\times\log(\hat{y}_o)\\&=-\log(\hat{y}_o)\end{aligned}$

(b)

$\begin{aligned} J&=-\log\hat{y_o}=-\log\frac{\exp(u_o^Tv_c)}{\sum_w\exp(u_w^Tv_c)}\\&=-u_o^Tv_c+\log(\sum_w\exp(u_w^Tv_c))\end{aligned}$ $\begin{aligned} \frac{\partial J}{\partial v_c}&=-u_o^T+\frac{1}{\sum_w\exp(u_w^Tv_c)}\sum_w\exp(u_w^Tv_c)u_w^T\\&=-u_o^T+\sum_w\frac{\exp(u_w^Tv_c)}{\sum_x\exp(u_x^Tv_c)}u_w^T\\&=-y^TU^T+\hat{y}^TU^T=(\hat{y}-y)^TU^T\end{aligned}$

notice that

$U_{:,w}=u_w, \\y=[0, \cdots,1,\cdots,0]^T, \\\hat{y}=U^Tv_c$

(c)

$\begin{aligned} J=-u_o^Tv_c+\log(\sum_w\exp(u_w^Tv_c))\end{aligned}$

when $w=o$

$\begin{aligned} \frac{\partial J}{\partial u_o}&=-v_c^T+\frac{\exp(u_o^Tv_c)v_c^T}{\sum_w\exp(u_w^Tv_c)}\\&=(y^T\hat{y}-1)v_c^T\end{aligned}$

when $w\not=o$

$\begin{aligned} \frac{\partial J}{\partial u_w}&=\frac{\exp(u_w^Tv_c)v_c^T}{\sum_w\exp(u_w^Tv_c)}\end{aligned}$

in summary,

$\begin{aligned} \frac{\partial J}{\partial U}&=(\hat{y}-y)v_c^T\end{aligned}$

(d)

let

$x=\begin{bmatrix}x_1\\x_2\\\vdots\\x_n\end{bmatrix}$

then

$\sigma(x)=\begin{bmatrix}\sigma(x_1)\\\sigma(x_2)\\\vdots\\\sigma(x_n)\end{bmatrix}$

$\frac{\partial \sigma(x)}{\partial x}=\begin{bmatrix}\frac{\partial \sigma(x_1)}{\partial x_1}&\frac{\partial \sigma(x_1)}{\partial x_2}&\cdots&\frac{\partial \sigma(x_1)}{\partial x_n}\\\frac{\partial \sigma(x_2)}{\partial x_1}&\frac{\partial \sigma(x_2)}{\partial x_2}&\cdots&\frac{\partial \sigma(x_2)}{\partial x_n}\\\vdots\\\frac{\partial \sigma(x_n)}{\partial x_1}&\frac{\partial \sigma(x_n)}{\partial x_2}&\cdots&\frac{\partial \sigma(x_n)}{\partial x_n}\end{bmatrix}$

which is

$\begin{bmatrix}\sigma'(x_1)&0&\cdots&0\\0&\sigma'(x_2)&\cdots&0\\\vdots\\0&0&\cdots&\sigma'(x_n)\end{bmatrix}$

(e)

$\begin{aligned}J=-\log(\sigma(u_o^Tv_c))-\sum_k\log(\sigma(-u_k^Tv_c))\end{aligned}$

first, we compute $\frac{\partial J}{\partial v_c}$.

$\begin{aligned}\frac{\partial -\log(\sigma(u_o^Tv_c))}{\partial v_c}&=\frac{\partial -\log(\sigma(u_o^Tv_c))}{\partial \sigma(u_o^Tv_c)}\frac{\partial \sigma(u_o^Tv_c)}{\partial u_o^Tv_c}\frac{\partial u_o^Tv_c}{\partial v_c}\\&=-\frac{1}{\sigma(u_o^Tv_c)}\sigma(u_o^Tv_c)(1-\sigma(u_o^Tv_c))u_o^T\\&=-(1-\sigma(u_o^Tv_c))u_o^T\end{aligned}$

similarily,

$\begin{aligned}\frac{\partial -\sum_k\log(\sigma(-u_k^Tv_c))}{\partial v_c}&=\sum_k(1-\sigma(-u_k^Tv_c))u_k^T\end{aligned}$

therefore,

$\begin{aligned}\frac{\partial J}{\partial v_c}&=-(1-\sigma(u_o^Tv_c))u_o^T+\sum_k(1-\sigma(-u_k^Tv_c))u_k^T\end{aligned}$

the time complexity of computing $\frac{\partial J}{\partial v_c}$ is $O(dk)$. while computing $\frac{\partial J}{\partial v_c}$ for softmax is $O(|V|)$, which is much larger than $O(dk)$.

in the same way, we can get

$\begin{aligned}\frac{\partial J}{\partial u_o}&=-(1-\sigma(u_o^Tv_c))v_c^T\end{aligned}$ $\begin{aligned}\frac{\partial J}{\partial u_k}&=(1-\sigma(-u_k^Tv_c))v_c^T\end{aligned}$

the time complexity for computing the gradient for a $u$ is $O(d)$, which is much smaller than that of softmax $O(|V|)$.

in summary, the negative sampling version of word2vec is much more computational efficient than the softmax version.

(f)

(i)

since

$\begin{aligned} \frac{\partial J}{\partial U}&=(\hat{y}-y)v_c^T\end{aligned}$

we can imply that

$\begin{aligned} \frac{\partial J(w_{t-m},\cdots,w_{t+m})}{\partial U}&=\sum_{-m\leq j\leq m,j\not=0}(\hat{y}-y)v_c^T\end{aligned}$

(ii)

$\begin{aligned} \frac{\partial J}{\partial v_c}=(\hat{y}-y)^TU^T\end{aligned}$

$\begin{aligned} \frac{\partial J(w_{t-m},\cdots,w_{t+m})}{\partial v_c}&=\sum_{-m\leq j\leq m,j\not=0}(\hat{y}-y)^TU^T\end{aligned}$

(iii)

$v_w$ is not related to $v_c, U$, so

$\begin{aligned} \frac{\partial J(w_{t-m},\cdots,w_{t+m})}{\partial v_w}=0\end{aligned}$

(a)

(b)

(c)

(d)

(e)

(f)

(i)

(ii)

(iii)

Share on: