Hacker News new | ask | show | jobs
by wilsonmitchell 1471 days ago
That's really interesting. Let me know if you'd want to talk to me or Michael sometime about it. With the models we run currently, you'd really have to have a GPU to run locally and get a lot of utility. I'm curious if you have some thoughts on how to run these large language models on edge devices.

I'm wilson@ our website (trying to avoid too much spam from bots).

1 comments

I sent y'all an email, but figured I'd re-post here for any curious hackers. I spent two years obsessed with autocomplete for mobile/edge use cases.

The first step is to get any functional offline model (1), then prune/project a large language model's representation until you can perform on-device inference (2). You can calculate variance, hit / miss statistics for a body of text and model proposals (3), which you can feed into a ranking model (4) for an extra layer of personalization or use to re-balance the Euclidean projection of your model's layers (4) to optimize for sparseness.

1) Locally store a Trie data structure, where keys are n-grams of user input

Surprisingly effective, considering most business communication uses a limited vocabulary. If your users are submitting less than 10,000 unique English words (skip words removed) per day, try this out.

One thing I really liked about the Trie approach is that corporate jargon appears in real-time, since the "model" is just a data retrieval algorithm. You don't need to modify a vocabulary and re-train/fine-tune a neural network to achieve personalization.

The downside is that you're limited to bi/tri-grams before performance degrades, although YMMV. Auto-completing bi/tri grams does feel tedious after a while.

2) Fine-tune and prune a large language model, then make it sparse

I noticed y'all offer some degree of personalization. Have you tried pruning or compressing your model after fine-tuning? The exact technique will depend on your base model's architecture but in general, try using a sparser representation.

Use accelerators designed to operate on sparse representations, for example TensorFlow XNNPack's sparse operations. XNNPack is a backend engine that opens up native hardware acceleration options in WebAssembly, so you can accelerate inference using the client's GPU (if available).

3) Collect permutation variance and hit/miss statistics

The exact technique/algorithm will depend on your model architecture, but for example matched averaging is a way to express the average number of neighborhood permutations with respect to the input dataset. In other words, the client sends statistics about predictions in Euclidean space, not your literal keystrokes.

4) Use matched averaging to adjust model cardinality or train an additional ranking model

The statistics collected by step 3 can be used to train a personalized ranking model, with the goal of re-ranking the proposals from step 2.

You can also use these statistics to introspect the "embedding space" of a language model, with the goal of identifying compression/pruning opportunities to improve the model's real-time performance. Reducing cardinality in the embedding projection has an outsize impact on inference speed, and you can usually drop most of the language model after observing the range of language used by the client.

You can also used matched averaging to compare hidden <-> hidden weights between many with Euclidean distance measurements (like cosine distance).

This is WAY more than I originally intended to write - but I hope this helps!

Thanks for the thoughts here. Will follow-up over email!