Hacker News new | ask | show | jobs
by switchbak 6 days ago
Now the next bottleneck is the compiler - which we can model in an LLM! It's only wrong 15% of the time :)

But truly, using Cerebras at ~2k tokens/s, with very low latency is like a vision into the future. You start to rework your workflow around things that can happen without onerous manual review - stating the conditions for success, etc. It's rare that I have a problem that maps well to that, but I expect this is where things are headed.

Of course the fast models tend to not be the SOTA ones, but if that was the case - high quality and near-instant thinking, that's a game changer that I don't think we're really prepared for. The things that get unlocked with higher-than-reasonable speed become very interesting.

1 comments

Have you tried https://chatjimmy.ai/ it’s only a demo but it blew my mind. I had the sudden feeling that this is the future.
What do you mean "demo"? Seems to work... Who is behind this?
It's a 3 bit quant of Llama3-8B. I'm sure there are use-cases for that, but it's useless when it comes to tool calls or coding and I wouldn't trust it's factual accuracy either.