| HN Mirror

That's afaik already the size when using half precision. And it's not like GPT3 doesn't get half queries wrong as-is in its optimal form, detracting from it will just make it unusable. Check out GPT-J if you want to see a functionally useless model that can be run locally.

It's not like this is something that can be surmounted without extra storage, neural nets are just a fancy system for compression and data access. You can either go for less data or more compression but either way your results won't ever be as good.