the question would be where he got the language data
If the original language data is available I'd suggest classifying the trigrams as "high" and "low" frequency, which should improve performance without needing to keep full frequency data.
No full-frequency data is kept, only 300 top-trigrams are identified. A quick through the source also reveals wooorm/trigrams, and wooorm/udhr, as sources!