Hacker News new | ask | show | jobs
by pjmlp 131 days ago
Anyone doing benchmarks with managed runtimes, or serverless, knows it isn't quite true.

Which is exactly one of the AOT only, no GC, crowds use as example why theirs is better.

3 comments

Reproducible builds exist. AOT/JIT and GC are just not very relevant to this issue, not sure why you brought them up.
Because they are dynamic compilers!
But there is functional equivalence. While I don't want to downplay the importance of performance, we're talking about something categorically different when comparing LLMs to compilers.
Not when those LLMs are tied to agents, replacing what would be classical programming.

Using low code platforms with AI based automations, like most iPaaS are now doing.

If the agent is able to retrieve the required data from a JSON file, fill an email with the proper subject and body, sending it to another SaaS application, it is one less integration middleware that was required to be written.

For all practical business point of view it is an application.

Even those are way more predictable than LLMs, given the same input. But more importantly, LLMs aren’t stateless across executions, which is a huge no-no.
> But more importantly, LLMs aren’t stateless across executions, which is a huge no-no.

They are, actually. A "fresh chat" with an LLM is non-deterministic but also stateless. Of course agentic workflows add memory, possibly RAG etc. but that memory is stored somewhere in plain English; you can just go and look at it. It may not be stateless but the state is fully known.

Using the managed runtime analogy, what you are saying is that, if I wanted to benchmark LLMs like I would do with runtimes, I would need to take the delta between versions, plus that between whatever memory they may have. I don’t see how that helps with reproducibility.

Perhaps more importantly, how would I quantify such “memory”? In other words, how could I verify that two memory inputs are the same, and how could I formalize the entirety of such inputs with the same outputs?

Are you certain to predict the JIT generated machine code given the JVM bytecode?

Without taking anything else into account that the JIT uses on its decision tree?

For a single execution, to a certain extent, yes.

But that’s not the point I’m trying to make here. JIT compilers are vastly more predictable than LLMs. I can take any two JVMs from any two vendors, and over several versions and years, I’m confident that they will produce the same outputs given the same inputs, to a certain degree, where the input is not only code but GC, libraries, etc.

I cannot do the same with two versions of the same LLM offering from a single vendor, that had been released one year apart.

Good luck mapping OpenJDK with Azul's cloud JIT, in generated machine code.
The output being the actual program output, not the byte code. No one is arguing that in the scope of LLMs.
Enough so that I've never had a runtime issue because the compiler did something odd once, and correct thr next time. At least in c#. If Java is doing that, then stop using it...

If the compiler had an issue like LLMs do, the half my builds would be broken, running the same source.