| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by iLoveOncall 49 days ago

This is just a hallucinations benchmark on a subset of outputs, not sure there's a value over general hallucinations benchmarks?

> Our goal is to be the best general model for deterministic tasks

I'm sorry but this simply doesn't make sense. If you want a deterministic output don't use an LLM.

2 comments

nemo1618 49 days ago

LLMs are not inherently non-deterministic. This is a common misconception. You used to be able to set temp=0 and a fixed seed and get the same output every time. This broke when labs started implementing batching, and no one bothered fixing it because the benefits of batching vastly outweighed the demand for deterministic output.

I am hopeful deterministic output will return, though; DeepSeek v4 claims to have implemented "bitwise batch-invariant and deterministic kernels," though I haven't tested it myself.

link

iLoveOncall 49 days ago

> LLMs are not inherently non-deterministic.

Reproducible does not mean deterministic. You cannot determine in advance what a prompt will give as output, even with a temperature of 0 and a fixed seed, therefore they are not deterministic.

link

nemo1618 48 days ago

Huh? I'm not aware of anyone else who defines "deterministic" that way. "Deterministic" comes from "determinism," as in "the effects are fully determined by the causes" -- not "determine" as in "deduce."

link

sroussey 49 days ago

Thinking Machines Lab uses batch invariant kernels, btw.

link

khurdula 49 days ago

General hallucinations benchmarks tend to be knowledge specific like GPQA or MMLU but none specifically measure structured output end-to-end which is one of the biggest use case for LLMs.

Many developer workflows use LLMs to produce structured artifacts due to it's flexibility of consuming unstructured inputs.

> "don't use an LLM"

Partially agree, that's what we're building towards at interfaze.ai a hybrid between transformers (LLMs) and traditional CNN/DNN architecture to solve this problem of "deterministic" output. This give devs the flexibility of custom schema definitions and unstructured input while still getting high quality structured output like you would get from a CNN models like EasyOCR.

The industry is moving toward using LLMs for more and more deterministic tasks so this benchmarks allows us to now measure it.

link