| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jekub 4837 days ago

There is some interesting things in this post but the main problem is not using Judy or something else, or talking about memory or complexity. The main problem I see here is using the wrong tool.

Replace the token matcher with a simple classifier (a maxent will work very well here) with n-gram of characters features through a hash-kernel and you get a very accurate, fast and low memory system.

I've build one two-years ago who accurately classified a bit over 100 different languages with only 64k features in the end. (So requiring only 8*64ko of memory) And this was without using file extensions as they weren't available in our case.

Before any hard optimizations, first check the methods used, and next the algorithm, anything else should go after.