Hacker News new | ask | show | jobs
by riffraff 4273 days ago
the question would be where he got the language data

If the original language data is available I'd suggest classifying the trigrams as "high" and "low" frequency, which should improve performance without needing to keep full frequency data.

1 comments

No full-frequency data is kept, only 300 top-trigrams are identified. A quick through the source also reveals wooorm/trigrams, and wooorm/udhr, as sources!
yes, I meant: keeping full frequency could have been avoided to save space/memory but having two classes high/low could be a good tradeoff.
It’s an interesting thought. I might fiddle on it, but I’m not sure it would work in practice (d’oh). Thanks!