| HN Mirror

Most of the parameters are in the language model (LLaMa-7B). So, they'd pretty much be the same techniques that would let LLaMa run on a single GPU -- especially lower precision tricks. If you only want to run inference/forward (no training), it should be pretty doable.

You can almost definitely run it on consumer GPU if you swap out the language model for something smaller as well (although the performance would definitely not be as good on the language side).