Hacker News new | ask | show | jobs
by yareally 4841 days ago
> because the wiki syntax is extremely convoluted and there is no formal spec

I ran into that when parsing out pages with Python for an app I am working on. Parsing it by conditions leads to a lot of conditions for edge cases, which as one might think happen more often as the more obscure the topic gets due to not being updated or improved to be more inline with the formatting of trafficked articles. If you are looking for something in particular, ranking elements on a page helps to a point if the elements you want are the ones that occur the most or near to it.

Aside from more obscure, less trafficked articles, I noticed many of the Non-English wiki articles are also formatted in awkward ways and appear far less updated to their English counterparts. I thought I had most edges cases covered until I started parsing out wiki markup for other languages.

1 comments

Ah, thanks for the warning. I haven't even touched the non-english articles yet :/
If you plan on doing both, which is pretty easy to do with their API (as you can grab all the potential languages from an article and the URL), I think I would think of testing against foreign languages first and then English once you have a basic parser going and search. Non-English had more weirdness, but it happened more often, so it became easier to eliminate similar cases in English articles that may happen more infrequent.

I ended up doing a lot of massive unit testing against various edge cases to make sure things were working. Even with that still, I would try to log any anomalies and put them aside for manual inspection later (by running checks on what "good" data should look like), just to be safe.