| Deepseek v1 is ~670Bn which is ~1.4TB physical. All digitized books ever written/encoded compress to a few TB. The public web is ~50TB. I think a usable zip of all english electronic text publicly available would be on O(100TB). So we're at about 1% of that in model size, and we're in a diminishing-returns area of training -- ie., going to >1% has not yielded improvements (cf. gpt4.5 vs 4o). This is why compute spend is moving to inference time with "reasoning" models. It's likely we're close to diminshing returns on inference-time compute now too, hence agents whereby (mostly,) deterministic tools are supplementing information /capability into the system. I think to get any more value out of this model class, we'll be looking at domain-specific specialisation beyond instruction fine-tuning. I'd guess targeting 1TB inference-time VRAM would be a reasonable medium-term target for high quality open source models -- that's within the reach of most SMEs today. That's about 250bn params. |
After that, make the robots explore and interact with the world by themselves, to fetch even more data.
In all seriousness, adding image and interaction data will probably be enormously useful, even for generating text.