Hacker News new | ask | show | jobs
by texaslonghorn5 1445 days ago
In a worst case you can end up with the Scots Wikipedia situation, where some power editor created a bunch of pages using an entirely fabricated, overly stereotypical language and that influenced what people thought Scots actually was.
1 comments

This is one of the examples we keep in mind and that's also why we can't 100% trust public dataset labels. This motivated us to train a Language IDentification system for all the languages we wanted to handle in order to build the monolingual dataset. More details in the paper ;) Or here, if you have questions