Hacker News new | ask | show | jobs
by morgante 3560 days ago
> Crucially, this means they are not widely observed in printed standard English, which in turn means they can't be relevant to training a language model to understand printed standard English.

I agree that they're not widely observed in written English, but they are consistently observed in the WSJ, which was the origin of this entire debate.

As lqdc13 pointed out, NLP still isn't even consistently good at understanding standard English. One could reasonably posit that that's due to the inherent ambiguity and inconsistency of most writing and that focusing on a narrower, standardized document corpus (the WSJ) you could get better initial results. What, exactly, is controversial about that? Do you really think that the language of the WSJ is no more consistent and formalized than the language of Twitter users?