|
|
|
|
|
by yareally
4832 days ago
|
|
If you plan on doing both, which is pretty easy to do with their API (as you can grab all the potential languages from an article and the URL), I think I would think of testing against foreign languages first and then English once you have a basic parser going and search. Non-English had more weirdness, but it happened more often, so it became easier to eliminate similar cases in English articles that may happen more infrequent. I ended up doing a lot of massive unit testing against various edge cases to make sure things were working. Even with that still, I would try to log any anomalies and put them aside for manual inspection later (by running checks on what "good" data should look like), just to be safe. |
|