Hacker News new | ask | show | jobs
by upon_drumhead 919 days ago
I’m a tad confused

> TinyChatEngine provides an off-line open-source large language model (LLM) that has been reduced in size.

But then they download the models from huggingface. I don’t understand how these are smaller? Or do they modify them locally?

2 comments

https://github.com/mit-han-lab/TinyChatEngine

Turns out the original source is actually somewhat informative. Including telling you how much hardware do you need. This blog post looks like your typical note you leave for yourself to annotate a bit of your shell history.

I wish that all these repos were more clear about the hardware requirements. Seeing that it runs on a 8 GB Raspberry, probably with abysmal performance, I'd say that it will run on my 32 GB Intel laptop on the CPU. Will it run on its Nvidia card? I remember that the rule of thumb was one GB of GPU RAM per G parameters, so I'd say that it won't run. However this has 4 bit quantization so it could have lower requirements.

Of course the main problem is that I don't know enough about the subject to reason on it on my own.

Roughly speaking I believe it's the number of parameters times the size of the parameters. So in the 4 bit case it's half a gigabyte per billion parameters.

From a performance point of view (quantized) integer parameters are going to run better on CPUs than floating point parameters.

Your assessment is exactly correct -- the blog post is my note-to-self about getting the repo to work. My "added value" in the post is a Dockerfile for ease of installation.
They have postprocessed the models specifically for size and latency. They published several papers on this.

Their optimized models are not downloaded from HF, but from dropbox. I have no idea why.