CS224n Assignment Answer 3 | TWO2TWO2TWO2TWO2

1.

(a)

i.

The process of gradient descent, especially stochastic gradient descent (SGD) and mini-batch stochastic gradient descent, usually oscillates, resulting in a “winding” path to the minimum. In the traditional gradient descent, the $m$ is actually:

$\begin{aligned}m_1&=\nabla_1\\m_2&=\nabla_2\\m_3&=\nabla_3\\\cdots\\m_n&=\nabla_n\end{aligned}$

While with momentum, $m$ is:

$\begin{aligned}m_1&=(1-\beta)\nabla_1\\m_2&=\beta(1-\beta)\nabla_1+(1-\beta)\nabla_2\\m_3&=\beta^2(1-\beta)\nabla_1+\beta(1-\beta)\nabla_2+(1-\beta)\nabla_3\\\cdots\\m_n&=\beta^{n-1}(1-\beta)\nabla_1+\beta^{n-2}(1-\beta)\nabla_2+\cdots+(1-\beta)\nabla_n\end{aligned}$

$m$ is the weighted average of $\nabla_1,\nabla_2,\cdots,\nabla_n$. With momentum, the direction of the gradient at the last iteration is more similar to the direction of that at current iteration. This is kind of “smoothing” the gradient path. On the other hand, it seems that the $\nabla_n$ for the “velocity” $m_n$ at the $n$-th iteration is its “accelaration”, and $m_{n-1}$ is its current velocity. This simulation to physical motion gives gradient descent the “inertia”, and therefore the direction of the gradient is more stable than simply assigning it with $\nabla_n$. This can help avoiding problems like oscillation and overshooting.

ii.

$v_n$ is positively related to $\nabla_n$. The bigger the parameter, the bigger the $v_n$, and the smaller it got updated at an iteration. By dividing $v_n$, Adam automatically adjusts the learning rate for each parameter during the gradient descent. Under some circumstances, parameters get gradient of different scales, especially when parameters are not normalized before training. Then, parameters of larger gradients will diverge if the learning rate is too big. However, by selecting a relatively small learning rate, the gradient descent process would iterate too slowly. The adaptive learning rates method adjusts the learning rate for each parameter, which would speed up convergence.

(b)

i.

since

$\begin{aligned}\mathbb{E}[h_{drop}]_i&=p_{prop}\times 0 +(1-p_{prop})\times \gamma h_i\\&=(1-p_{prop})\gamma h_i\end{aligned}$

while

$\mathbb{E}[h_{drop}]=h_i$

$\gamma=1/(1-p_{prop})$

ii.

Dropout is a technique of regularization by adding noise to layers during training, which makes the network like an “ensemble” of several smaller trees and more robust. However, adding noise to the network when testing is harmful, which would reduce its accuracy.

2.

(a)

Stack	Buffer	New dependency	Transition
[ROOT]	[I, parsed, this, sentence, correctly]		Initial Conﬁguration
[ROOT, I]	[parsed, this, sentence, correctly]		SHIFT
[ROOT, I, parsed]	[this, sentence, correctly]		SHIFT
[ROOT, parsed]	[this, sentence, correctly]	parsed→I	LEFT-ARC
[ROOT, parsed, this]	[sentence, correctly]		SHIFT
[ROOT, parsed, this, sentence]	[correctly]		SHIFT
[ROOT, parsed, sentence]	[correctly]	sentence→this	LEFT-ARC
[ROOT, parsed]	[correctly]	parsed→sentence	RIGHT-ARC
[ROOT, parsed, correctly]	[]		SHIFT
[ROOT, parsed]	[]	parsed→correctly	RIGHT-ARC
[ROOT]	[]	ROOT→parsed	RIGHT-ARC

(b)

Each word should be shifted into the stack: n steps.
Each word also needs to be removed from the stack by LEFT-ARC or RIGHT-ARC: n steps.

So there are totally 2n steps.

(f)

i.

Error type: Verb Phrase Attachment Error
Incorrect dependency: wedding→ fearing
Correct dependency: heading → fearing

ii.

Error type: Coordination Phrase Attachment Error
Incorrect dependency: rescue → and
Correct dependency: rescue → rush

iii.

Error type: Prepositional Phrase Attachment Error
Incorrect dependency: Midland → named
Correct dependency: Midland → guy

iiii.

Error type: Modifier Phrase Attachment Error
Incorrect dependency: elements → most
Correct dependency: crucial → most

1.

(a)

i.

ii.

(b)

i.

ii.

2.

(a)

(b)

(f)

i.

ii.

iii.

iiii.

Share on: