| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hywel 4276 days ago
	Based on a 2-sec look at the code, it's using a built-in database of trigrams as a predictor of the language. https://github.com/wooorm/franc/blob/master/lib/data.json

2 comments

allan_s 4276 days ago

my bad, I've been looking to data folder first and haven’t found anything, I should have tried harder

link

riffraff 4276 days ago

the question would be where he got the language data

If the original language data is available I'd suggest classifying the trigrams as "high" and "low" frequency, which should improve performance without needing to keep full frequency data.

link

wooorm 4276 days ago

No full-frequency data is kept, only 300 top-trigrams are identified. A quick through the source also reveals wooorm/trigrams, and wooorm/udhr, as sources!

link

riffraff 4276 days ago

yes, I meant: keeping full frequency could have been avoided to save space/memory but having two classes high/low could be a good tradeoff.

link

wooorm 4275 days ago

It’s an interesting thought. I might fiddle on it, but I’m not sure it would work in practice (d’oh). Thanks!

link