Hacker News new | ask | show | jobs
by sp332 3034 days ago
In the Tagalog file, } is near the top but { is over 8,000 lines down. Is there a reason they have such different frequencies? ( and ) are right next to each other.

And yes I realize this is a really odd question :)

2 comments

This is probably due to our preprocessing of Wikipedia that did not get rid of all the '}' from the markup.
Oh true. I tried to clean up Wiki markup for ML years ago and it was a huge pain. Next time I think I'll parse the HTML version and pull out the text from the tags explicitly.
This is a much better way to do it. It's easier, cleaner, and gets the text which is generated by templates, which there is a surprising amount of (you get weird artifacts from that otherwise).
Your comment has twice as many ) as it does (

My first guess would be emojis ;)