| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yigitkonur35 639 days ago
	I get your worries about LLMs and their consistency problems. But I think we can fix a lot of that using LLMs themselves for checks. If you're after top-notch accuracy, you could throw in another prompt, add some visual and text input, and double-check that nothing's lost in translation. The cheaper models are actually great for this kind of quality control. LLMs have come a long way since they first showed up, and I reckon they've stepped up their game enough to shake off that bad rap for giving mixed signals.

2 comments

Oras 639 days ago

How would you know something is missing?

I tried multiple OCRs before and it’s hard to tell if the output is accurate or not but just comparing manually.

I created a tool to visualise the output of OCR [0] to see what’s missing and there are many cases that would be quite concerning especially when working with financial data.

This tool wouldn’t work with LLMs as they don’t return the character recognition (to my knowledge), which will make it harder to evaluate them on a scale.

If I want to use LLMs for the task, I would use them to help with training ML model to do OCR better, such as creating thousands of synthetic data to train.

[0] https://github.com/orasik/parsevision

link

yigitkonur35 639 days ago

Wow, you knocked it out of the park! I'll be sure to use this when I tackle that evaluation.

link

whiplash451 639 days ago

If you can use an LLM for sanity checking, why can’t you use it for extraction at the first place?

link

ithkuil 638 days ago

Because currently models output a stream of tokens directly which are the performance and billing unit. Better models can do a better job at producing reasonable output but there is a limit to what can be done "on the fly".

Some models like openai o1 started employing internal "thinking" tokens which may or may not be equivalent to performing multiple passes with the same or different models but it has a similar effect.

One way to look at it is that if you want better results you have to put more computational resources in thinking. Also, just like humans, a team effort yields better results in producing well rounded results because you combine the strengths and you offset the weaknesses of different team members.

You can technically wrap all this into a single black box and have it converse with you as if it was one single entity that internally uses multiple models to think and cross check etc. The output is likely not going to be in real-time though and real time conversation was until now a very important feature.

In future we may on one hand relax the real time constraint and accept that for some tasks accuracy is more important than real time results.

Or we may eventually have faster machines or more clever algorithms that may "think" more in shorter amounts of time.

(Or a combination of the two)

link