1.
(a)
i.
The process of gradient descent, especially stochastic gradient descent (SGD) and mini-batch stochastic gradient descent, usually oscillates, resulting in a “winding” path to the minimum. In the traditional gradient descent, the $m$ is actually:
While with momentum, $m$ is:
$m$ is the weighted average of $\nabla_1,\nabla_2,\cdots,\nabla_n$. With momentum, the direction of the gradient at the last iteration is more similar to the direction of that at current iteration. This is kind of “smoothing” the gradient path. On the other hand, it seems that the $\nabla_n$ for the “velocity” $m_n$ at the $n$-th iteration is its “accelaration”, and $m_{n-1}$ is its current velocity. This simulation to physical motion gives gradient descent the “inertia”, and therefore the direction of the gradient is more stable than simply assigning it with $\nabla_n$. This can help avoiding problems like oscillation and overshooting.
ii.
$v_n$ is positively related to $\nabla_n$. The bigger the parameter, the bigger the $v_n$, and the smaller it got updated at an iteration. By dividing $v_n$, Adam automatically adjusts the learning rate for each parameter during the gradient descent. Under some circumstances, parameters get gradient of different scales, especially when parameters are not normalized before training. Then, parameters of larger gradients will diverge if the learning rate is too big. However, by selecting a relatively small learning rate, the gradient descent process would iterate too slowly. The adaptive learning rates method adjusts the learning rate for each parameter, which would speed up convergence.
(b)
i.
since
while
so
ii.
Dropout is a technique of regularization by adding noise to layers during training, which makes the network like an “ensemble” of several smaller trees and more robust. However, adding noise to the network when testing is harmful, which would reduce its accuracy.
2.
(a)
Stack | Buffer | New dependency | Transition |
---|---|---|---|
[ROOT] | [I, parsed, this, sentence, correctly] | Initial Configuration | |
[ROOT, I] | [parsed, this, sentence, correctly] | SHIFT | |
[ROOT, I, parsed] | [this, sentence, correctly] | SHIFT | |
[ROOT, parsed] | [this, sentence, correctly] | parsed→I | LEFT-ARC |
[ROOT, parsed, this] | [sentence, correctly] | SHIFT | |
[ROOT, parsed, this, sentence] | [correctly] | SHIFT | |
[ROOT, parsed, sentence] | [correctly] | sentence→this | LEFT-ARC |
[ROOT, parsed] | [correctly] | parsed→sentence | RIGHT-ARC |
[ROOT, parsed, correctly] | [] | SHIFT | |
[ROOT, parsed] | [] | parsed→correctly | RIGHT-ARC |
[ROOT] | [] | ROOT→parsed | RIGHT-ARC |
(b)
- Each word should be shifted into the stack: n steps.
- Each word also needs to be removed from the stack by LEFT-ARC or RIGHT-ARC: n steps.
So there are totally 2n steps.
(f)
i.
- Error type: Verb Phrase Attachment Error
- Incorrect dependency: wedding→ fearing
- Correct dependency: heading → fearing
ii.
- Error type: Coordination Phrase Attachment Error
- Incorrect dependency: rescue → and
- Correct dependency: rescue → rush
iii.
- Error type: Prepositional Phrase Attachment Error
- Incorrect dependency: Midland → named
- Correct dependency: Midland → guy
iiii.
- Error type: Modifier Phrase Attachment Error
- Incorrect dependency: elements → most
- Correct dependency: crucial → most