Hacker News new | ask | show | jobs
by rightbyte 1280 days ago
How much disk space does 175B parameters use? A float or half precision float per parameter or does it need pointers to connections too?

Given how responses are generated in seconds and for free I am fairly sure it could run on a desktop computer.

1 comments

One float per param, so naively 175*4 = ~700GB on disk. Most recent models are trained in FP16 or BF16 so 350GB. And there's some work on quantizing them to INT8 so knock that down to a mere 175GB. You can definitely run it on a desktop computer using RAM and NVME offload to make up for the fact that you probably don't have 175GB of GPU memory available, but it won't be fast: https://huggingface.co/blog/bloom-inference-pytorch-scripts

OpenAI generates responses so fast by doing the generation in parallel across something like 8x80GB A100s (I don't know the exact details of their hardware setup, but NVIDIA's open FasterTransformer library achieves low latency for large models this way).

It'd be pretty surprising if you could quantize a text model and have it still work. It has to be using those lower bits to store text; it's not like you can round a letter up or down.
It's not storing any text? The weights are floating point numbers - the "text" is in some extremely high dimensional embedding space.
Of course it's storing text. GPT was trained for less than one epoch; they just continually throw new text in there and it mostly just remembers it (= learns it = compresses it). It's not simply "a high dimensional embedding" because words aren't differentiable; you'll get different words if you round off your "coordinates".

If you go to https://beta.openai.com/playground/ and prompt it "Read me the book Alice in Wonderland" it will quote you word for word the original book.

GPT's compression of text is a model of probabilities for the next token in a sequence, where a token is a bit of text from a vocabulary of ~52,000. You can definitely reduce the precision of the parameters that determine that model without hurting the model's overall accuracy much (consider truncating a probability like 98.0000001221151240690% to 98.0%).

Empirically, people have quantized the weights of language models down to INT4 with very little loss in accuracy; see GLM-130B: https://arxiv.org/abs/2210.02414