Hacker News new | ask | show | jobs
by exgrv 3034 days ago
We decided to keep the casing, as it is useful for some applications such as named entity recognition.

Regarding the punctuation, as pointed out in another comment, these tokens might also be useful for some applications (and they are easy to filter out if you don't need them).

1 comments

In the Tagalog file, } is near the top but { is over 8,000 lines down. Is there a reason they have such different frequencies? ( and ) are right next to each other.

And yes I realize this is a really odd question :)

This is probably due to our preprocessing of Wikipedia that did not get rid of all the '}' from the markup.
Oh true. I tried to clean up Wiki markup for ML years ago and it was a huge pain. Next time I think I'll parse the HTML version and pull out the text from the tags explicitly.
This is a much better way to do it. It's easier, cleaner, and gets the text which is generated by templates, which there is a surprising amount of (you get weird artifacts from that otherwise).
Your comment has twice as many ) as it does (

My first guess would be emojis ;)