Hacker News new | ask | show | jobs
by tadkar 1917 days ago
There is a similar great project here [1] with the Hungarian Wikipedia corpus. Great workout for non English and maybe non-ascii operations.

The performance of Java there is super impressive. It should port relatively quickly to this file too...

[1] https://github.com/juditacs/wordcount

1 comments

Great, it would be nice if Java was included by the op