| HN Mirror

Reading it back my comment wasn't the best explanation of the issue.

The reason is that what we're really doing here is predicting a structure (a parse tree), but we've encoded the problem as a series of local steps. Think of this like, what we want to do is navigate to a goal, and we'll do this by predicting a series of local actions.

Try stepping through the decision process.[1] This should give you a feel for the local decisions, and how they build the larger structure.

If we use an online learner, we can take advantage of an analytic method introduced in 2012 of calculating the global loss of a local action (the "dynamic oracle"), to do imitation learning.

Specifically, during training we generate examples with the parser, and label them with this "dynamic oracle". A large batch size means we're generating the examples with a model that's "out of date".

[1] http://spacy.io/displacy/?manual=Shift%20words%20onto%20the%....