Hacker News new | ask | show | jobs
by yareally 4832 days ago
If you plan on doing both, which is pretty easy to do with their API (as you can grab all the potential languages from an article and the URL), I think I would think of testing against foreign languages first and then English once you have a basic parser going and search. Non-English had more weirdness, but it happened more often, so it became easier to eliminate similar cases in English articles that may happen more infrequent.

I ended up doing a lot of massive unit testing against various edge cases to make sure things were working. Even with that still, I would try to log any anomalies and put them aside for manual inspection later (by running checks on what "good" data should look like), just to be safe.