|
|
|
|
|
by tadkar
1917 days ago
|
|
There is a similar great project here [1] with the Hungarian Wikipedia corpus. Great workout for non English and maybe non-ascii operations. The performance of Java there is super impressive. It should port relatively quickly to this file too... [1] https://github.com/juditacs/wordcount |
|