Hacker News new | ask | show | jobs
by varunkmohan 1277 days ago
A 1T model would be capable of much more than what the current version of Copilot in terms of autocompletion and even code correction. However, at that point, even with a lot of model parallelism to speedup inference, it's likely to be atleast 10x slower on the generation side. From my experience working on Codeium, a Copilot alternative, this would be too frustrating for users. It could be useful as a tool that runs asynchronously that modifies all your code at scale.
3 comments

Given how fast Copilot is (a few seconds), I wouldn't mind waiting 10x. I also wouldn't mind letting it run overnight for some tasks (ie: write documentation, write tests, suggest bug fixes, etc...). Will check on my buddy on the next morning.
I think the UX of large suggestions will require a lot of thinking and experimentation. That's because the longer the output of such model, higher the risk of it making some mistake. For short completions, it's often easy to identify mistakes from useful suggestions (though sometimes subtle bugs slip in). But for longer completion, it'll get tedious and we might start accepting wrong suggestions.
That sounds like modern day outsourcing
It could be interesting if it was an alternative that a user could query. I could imagine someone starting to write a new function might be willing to wait 10x more time to get something better.
Very true, I think the issue though is unless that output is very likely to be 100% correct, a user would always prefer something that is incomplete but quicker to iterate on. It would be interesting to see if we can get to a paradigm like that.
Though isn't it highly likely that core devs working at the big tech giants have access to 10x-100x faster compute, e.g. some secret TPU successor at Google?
The magical number for performance is actually memory bandwidth which is actually lower for TPUs compared to A100s. They have more aggregate compute, but it's not trivial to use that to get very low latency on a per request basis.
But they have highly likely internal prototypes with higher bandwidth and latency. Also, with distilled latent diffusion one can probably generate text(-images) much faster anyhow as it could produce long chunks of text at once, rather than needing recurrently feed back the new token to the inputs.