| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by moyix 1280 days ago
	One float per param, so naively 175*4 = ~700GB on disk. Most recent models are trained in FP16 or BF16 so 350GB. And there's some work on quantizing them to INT8 so knock that down to a mere 175GB. You can definitely run it on a desktop computer using RAM and NVME offload to make up for the fact that you probably don't have 175GB of GPU memory available, but it won't be fast: https://huggingface.co/blog/bloom-inference-pytorch-scripts OpenAI generates responses so fast by doing the generation in parallel across something like 8x80GB A100s (I don't know the exact details of their hardware setup, but NVIDIA's open FasterTransformer library achieves low latency for large models this way).

1 comments

astrange 1280 days ago

It'd be pretty surprising if you could quantize a text model and have it still work. It has to be using those lower bits to store text; it's not like you can round a letter up or down.

link

typon 1280 days ago

It's not storing any text? The weights are floating point numbers - the "text" is in some extremely high dimensional embedding space.

link

astrange 1280 days ago

Of course it's storing text. GPT was trained for less than one epoch; they just continually throw new text in there and it mostly just remembers it (= learns it = compresses it). It's not simply "a high dimensional embedding" because words aren't differentiable; you'll get different words if you round off your "coordinates".

If you go to https://beta.openai.com/playground/ and prompt it "Read me the book Alice in Wonderland" it will quote you word for word the original book.

link

moyix 1279 days ago

GPT's compression of text is a model of probabilities for the next token in a sequence, where a token is a bit of text from a vocabulary of ~52,000. You can definitely reduce the precision of the parameters that determine that model without hurting the model's overall accuracy much (consider truncating a probability like 98.0000001221151240690% to 98.0%).

Empirically, people have quantized the weights of language models down to INT4 with very little loss in accuracy; see GLM-130B: https://arxiv.org/abs/2210.02414

link