Hacker News new | ask | show | jobs
by retrac98 826 days ago
There are so many applications for LLMs where having a perfect score is much more important than speed, because getting it wrong is so expensive, damaging, or time consuming to resolve for an organisation.
6 comments

If you need a perfect score, don't use LLMs. This seems obvious to me, even given the state of the art LLMs. I am a heavy user of GPT4 and I wouldn't bet $1000 bucks on it being 100% reliable for any non-trivial task.
They'll get better. Humans are far from perfect, and I have no doubt that LLMs will eventually outperform them for non-trivial tasks consistently.
Maybe so, but at this stage I wouldn't be betting a business model on it.
Businesses do bet on imperfect and even criminal models all the time (way before LLMs existed)... they call it cost of doing business when they get it wrong or get caught.
> Humans are far from perfect

Humans running multishot with mixture of experts is close to perfect. You can't compare a multishot mixture of expert AI to a single human, humans doesn't work in isolation.

Machine learning models will get better for sure. We don't know if LLM are the end game though and it's not sure if this particular technique is what we'll need to reach the next level.
Or they might not get better. It could be that we are at a local optimum for that sort of thing, and major improvements will have to wait (perhaps for a very long time) for radical new technologies.
Maybe, but it certainly hasn’t been the arc of the past few years. I don’t know how anyone could look at this and assume that it’s likely to slow down.
They already have superhuman image classification performance.
I remember talking to a radiologist who said he was sure something like this was coming like ten years ago where instead of a radiologist looking at scans manually, a machine would go through a lot of images and flag some for manual review.

We haven't even gotten there yet, have we?

Yes, we absolutely are there: https://youtu.be/D3oRN5JNMWs?feature=shared

My professor (Sir Michael Brady) at university 14 years ago set up a company to do this very thing, and he already had reliable models back before 2010. I believe their company was called Oxford Imaging or something similar.

Yep, everyone seems to forget that ML was available before 2021. Had a conversation recently with my former colleague who learned about some plastic packaging company which used "AI" to predict client orders and inform them about scheduling implications. When I told him that you don't need Transformers and 30GB models for that, he was quasi-confused, cause he kinda knew it but the hype just overtook his knowledge.
We haven't even gotten there yet, have we?

Yes and no. Countless teams have solved exactly this problems at universities and research groups across the world. Technically it's pretty much a solved problem. The hard part is getting the systems out of the labs and certified as an actual product and convincing hospitals and doctors to actually use them.

Maybe it's a liability issue, not a competency issue.
Until a single pixel makes a cat a dog or something like that.
Changing a single pixel is usually not enough to confuse convolutional neuronal networks. Even so, human supervision will probably always be quite important.
I've tried to apply it to parsing HTML as this article into a pretty long pipeline. I'm using DeepInfra with Mistral 8x7B and I'm still unsure if I'm going to use for production.

The problem I'm finding is that the time I wanted to save mantaining selectors and the like is time that I'm spending writing wrapper code and dealing with the mistakes it makes. Some are OK and can deal with them, others are pretty annoying because It's difficult to deal with them in a deterministic manner.

I've also tried with GPT-4 but it's way more expensive, and despite what this guy got, it also makes mistakes.

I don't really care about inference speed, but I do care about price and correctness.

Might be a silly question, but if you want determinism in this, why don't you get the LLM to write the deterministic code, and use that instead? Interesting experiment, though!

In fact, what about a hybrid of what you're doing now? Initially, you use an LLM to generate examples. And then from those examples, you use that same LLM to write deterministic code?

Have you tried swapping Mistral 8x7B with either command-r 34B, Qwen 1.5 70B, or miqu 70B? Those are all superior in my experience, though suited for slightly different tasks, so experimentation is needed.
Parsing HTML and tagsoup is IMHO not the right application for LLMs since these are ultimately structured formats. LLM are for NLP tasks, like extracting meaning out of unstructured and ambiguous text. The computational cost of an LLM chewing through even moderately-sized document can be more efficiently spent on sophisticated parser technologies that have been around for decades, which can also to a degree deal with ambiguous and irregular grammars. LLMs should be able to help you write those.
Yeah I agree - just an hour ago I was dealing with an LLM that was missing a "not" thus inverting the meaning of a rather important simulation parameter!
It makes much more sense to me to have the LLM infer the correct query for extracting data on the page. Much faster and reliable and it wouldn't really be a problem to have a human in the loop every now and then.
All the places I see AI being applicable to my work don't require a perfect score, and a threshold is actually much more useful, especially where multiple factors come together to make evaluation to a single value hard.
If you have speed you can generate multiple answers and have another model pick the best one.
If I ask an LLM a very complex and specific question 500 times, if it just doesn't know the facts you'll still get the wrong answer 500 times.

That's understandable. The real problem is when the AI lies/hallucinates another answer with confidence instead of saying "I don't know".

The problem is asking for facts, LLM are not a database so they know stuff but it is compressed so expect wrong facts, wrong names, dates, wrong anything.

We will need an LLM as a front end then it will generate a query to fetch the facts from the internet or a database , then maybe format the facts for your consumption.

This is called Retrieval Augmented Generation (RAG). The LLM driver recognizes a query, it gets send to a vector database or to an external system (could be another LLM...) and the answer is placed in the context. It's a common strategy to work around their limited context length, but it tends to be brittle. Look for survey papers.
That‘s exactly it. It‘s ok for LLMs to not know everything, because they _should_ have a means to look up information. What are some projects where this obvious approach is implemented/tried?
But then you need an LLM that can separate between grammar and facts. Current LLMs doesn't know the difference, that is the main source to these issues, these models treat facts like grammar and that worked well enough to excite people but probably wont get us to a good state.
The weird problem is with LLM hallucinations is that it usually will acknowledge its mistake and correct itself if you call it out. My question is why can't LLMs included a sub-routine to check itself before answering. Simply asking itself something like "this answer may not be correct, are you sure you're right?"
>The weird problem is with LLM hallucinations is that it usually will acknowledge its mistake and correct itself if you call it out.

From what I've tested, all of the current models will see a prompt like "are you sure that's correct" and respond "no, I was incorrect [here's some other answer]", irrespective of the accuracy of the original statement.

In my experience the corrections can be additional hallucinations one after another after pointing out inaccuracies even multiple times in a row.
> My question is why can't LLMs included a sub-routine to check itself before answering.

Because LLMs don't work in a way for that to be possible if you operate them on their own.

Here is the debug output of my local instance of Mistral-Instruct 8x7B. The prompt from me was 'What is poop spelled backwards?'. It answered 'puoP'. Let's see how it got there starting with it processing my prompt into tokens:

   'What (3195)', ' is (349)', ' po (1627)', 'op (410)', ' sp (668)', 'elled (6099)', ' backwards (24324)', '? (28804)', '\n (13)', '### (27332)', ' Response (12107)', ': (28747)', '\n (13)',
It tokenized 'poop' as two tokens: 'po', number 1627, and 'op', number 410.

Next it comes up with its response:

   Generating (1 / 512 tokens) [(pu 4.43%) (The 66.62%) (po 11.96%) (p 4.99%)]
   Generating (2 / 512 tokens) [(o 89.90%) (op 10.10%)]
   Generating (3 / 512 tokens) [(P 100.00%)]
   Generating (4 / 512 tokens) [( 100.00%)]
It picked 'pu' even though it was only a ~4% chance of being correct, then instead of picking 'op' it picked 'o'. The last token was a 100% probability of being 'P'.

   Output: puoP
At no time did it write 'puoP' as a complete word nor does it know what 'puoP' is. It has no way of evaluating whether that is the right answer or not. You would need a different process to do that.
The problem is that if you call it out, it will frequently change its answer, even if it was correct. LLMs currently lack chutzpa.
They definitely stand their ground if they were aligned to do so.
But then they stand their ground when wrong too.
That is a common bullshitting strategy, talk a lot of bullshit, and then backtrack and acknowledge you were wrong when people push back. That way they will think you know way more than you do. Many people will see thought that, but most will just think you are a humble expert who can acknowledge when you are wrong instead of you always acknowledging you are wrong even when you aren't.

People have a really hard time catching such bullshitting from humans, which is why free form interviews doesn't work.

Its because theres no entity that is actually acknowledging anything. Its generating an answer to your prompt. You can gaslight it into anything being wrong or correct.
They simply don't work that way. You are asking it for an answer, it will give you one since all it can do is extrapolate from its training data.

Good prompting and certain adjustment to the text generation parameters might help prevent hallucinations, but it's not an exact science since it depends on how it was trained. Also, an LLMs training data frankly said contains a lot of bulls*t.

> If I ask an LLM a very complex and specific question 500 times, if it just doesn't know the facts you'll still get the wrong answer 500 times.

Think the commenter meant use another model/LLM which could give a different answer, then let them vote on the result. Like "old fashioned AI" did with ensemble learning.