Hacker News new | ask | show | jobs
by d0mine 746 days ago
On any topic that I understand well, LLM output is garbage: it requires more energy to fix it than to solve the original problem to begin with.

Are we sure these exams are not present in the training data? (ability to recall information is not impressive for a computer)

Still I'm terrible at many many tasks e.g., drawing from description and the models widen significantly types of problems that I can even try (where results can be verified easily, and no precision is required)

7 comments

> On any topic that I understand well, LLM output is garbage: it requires more energy to fix it than to solve the original problem to begin with.

That's probably true, which is why human most knowledge workers aren't going away any time soon.

That said, I have better luck with a different approach: I use LLM's to learn things that I don't already understand well. This forces me to actively understand and validate the output, rather than consume it passively. With an LLM, I can easily ask questions, drill down, and try different ideas, like I'm working with a tutor. I find this to be much more effective than traditional learning techniques alone (e.g. textbooks, videos, blog posts, etc.).

Might be better to think of the LLM as the student, and you're an imposter tutor. You're trying to assess the kid's knowledge without knowing the material yourself, but the kid is likely to lie when he doesn't know something to try to impress you, hoping that you don't know your stuff either. So you just have to keep probing him with more questions to suss out if he's at least consistent.
I would classify all of those as "non-traditional" learning techniques, unless you actually mean using a textbook while taking a class with a human teacher.

Well written textbooks are consumable on their own for some people, but most are not written for that.

That's a good observation about textbooks and helps explain why I had difficulties trying to teach myself topics from a textbook alone!
A lot just aren't very good but they also tend to make assumptions about prior knowledge in line with what would be typical prerequisites for a class and some degree of guidance.
I've had teachers who didn't understand the subject they were teaching. It's not a good experience and replicating that seems like a terrible idea.
A key advantage is that LLMs dont have emotional states that need to be managed.
It depends on the topic (and the LLM - ChatGPT-4 equivalent at least, any model equivalent to 3.5 or earlier is just a toy in comparison) - but I've had plenty of success using it as a productivity enhancing tool for programming and AWS infrastructure, both to generate very useful code and as an alternative to Google for finding answers or at least a direction to answers. But I only use it where I'm confident I can vet the answers it provides.
> On any topic that I understand well, LLM output is garbage

I've heard that claim many times, but never is there any specific follow-up on which topics they mean. Of course, there are areas like math and programming where LLMs might not perform as well as a senior programmer or mathematician, sometimes producing programs that do not compile or incorrect calculations/ideas. However, this isn't exactly "garbage" as some suggest. At worst, it's more like a freshman-level answer, and at best, it can be a perfectly valid and correct response.

> At worst, it's more like a freshman-level answer

That is garbage.

I hope you don't hold a teaching position at a university then.
I did, the growth students have from first to second year is enormous. Everyone know freshmen produce garbage answers, that is why they are freshmen and not out doing work, they are there to learn not to produce answers. If freshmen answers were good enough people wouldn't bother hiring college grads, just hire dropouts and high school grads.

> I hope you don't hold a teaching position at a university then.

You think teachers shouldn't have growth mindset for students? I think students can grow from producing garbage answers to good answers, that is what they are there for. An LLM however doesn't grow, so while such students are worth teaching even though they produce garbage answers the LLM isn't.

> You think teachers shouldn't have growth mindset for students? I think students can grow from producing garbage answers to good answers, that is what they are there for.

I think many students including freshman have interesting and sometimes thought provoking ideas. And they come up with creative solutions, which is based on their previous experience in life. I would never describe that as garbage.

On what topics you understand well does GOT-4o or Claude Opus produce garbage?
I do run into the issue where the longer the conversation goes the more inaccurate the information.

But a common situation is that with code generation it will fail to understand the context of where the code belongs and so it's a function that will compile but makes no sense.

Yeah. I often springboard into a new context by having the LLM compose the next prompt based on the discussion and restart the context. Remarkably effective if you ask it to incorporate “prompt engineering” terms from research.
Anything deeper than surface level in medicine.

Try getting it to properly select crystalloids with proper additives for a patient with a given history and lab results and watch in horror as it confidently gives instructions that would kill the patient.

What is even more irritating is that I had gpt4 debate me on things that it was completely wrong about and it was only when I responded with a stern rebuke that it hit me with the usual "Apologies for the misunderstanding..."

LLMs are not good at answering expert level questions at the forefront of human knowledge.
Unfortunately it would be considered basic medicine in this case.
Is it basic but not documented? Basic to me means the first google search result is generally correct.
That's not how medicine operates.

Medical problems are highly contextual, so you are not going to get much valuable information at the level of what a doctor is thinking from the first page of Google. That doesn't mean it isn't a simple within our area of expertise.

To be fair, I have not found MDs to be particularly reliable for answering basic questions about medicine either.
OK. I can't speak for what you've experienced. I can only offer what I see from LLMs given what I know.
High school math problems.
I suspect by garbage you mean not perfect.

To be more precise can you please give a topic you know well and your % guess how often the answers are wrong on the topic?

I would take their meaning as 'contains enough errors to not be useful', which doesn't need a very high percentage of wrong answers.
Even better, looks right, might even compile, but will be doing the subtly (or obviously) wrong thing.
Functional linear analysis - it has tendency to produce a proof for unprovable statements; the proofs will be logically argued and well structured and step 8 will have a statement that is obvious nonsense even to a beginning student, me. The professor on the other hand will ask why I'm trying to prove the false statement and expertly help me find my logic error.
Specifics like this make it much easier to agree on LLM capabilities, thank you.

Automatic proof generation is a massive open problem in all of computer science and not close to be solved. It’s true LLMs aren’t great at it and more is required for example as with the geometry system Deepmind progresses on.

On the other hand they can be very useful to explain concepts and allow interactive questioning to drill down and help build understanding of complex mathematical concepts, all during a morning commute via the voice interface.

How do yo debug its hallucination misinformation via voice interface while you commute?
I just use my memory and verify later. Unlike a LLM I have persistent long term durable storage of knowledge. Typically I can pretty easily pick out a hallucination though because there’s often a very clear inconsistency or logical leap that is non sense.
I’m not the parent, but depending on the context, GPT-4 will often make up functions that then end up requiring research and correction; in other cases like once when I asked it to show me an example of a class of X86 assembly instructions, it just added a label and skipped the actual instruction and implementation!

Yesterday I was looking for some help on an issue with the unshare command; it repeatedly made bad assumptions about the nature of the error even I provided it with the full error message and one could already guess the initial cause by looking at that.

I guess such errors can be frighteningly common once you get outside of typical web development.

the models that you have tried .. are garbage. hmmm Maybe you are not among the many, many, many inside professionals and unofrmed services that have different access than you? money talks?
It is remarkable that folks who tried a garbage LLM like copilot, 3.5, Gemini, or made meta LLMs say naughty words, seem to think these are still SOA. Sometimes I stumble on them and I am shocked at the degradation in quality then realize my settings are wrong. People are vastly underestimating the rate of change here.
People have tried gpt-4, it does the same kind of errors as gpt-3, it just has a bigger set of known things where it does ok so it is immensely more useful.

It is like a calculator that only worked in one digit, and now it works on 2, the improvement is immense but its still nowhere close to replacing mathematicians since it isn't even working on the same kind of problems.

Edit: In several years we might have a perfect calculator that is better than any human at such tasks, but it still doesn't beat humans at stuff unrelated to calculations. Or in the case of LLMs pattern matching texts, humans don't pattern match texts to plan or mentally simulate scenarios etc, that part isn't covered by LLMs. Human level planning with todays LLM level pattern matching on text would be really useful, we see a lot of humans work that way by using the LLM as a pattern matcher, but there is no progress on automating human level planning so far, LLMs aren't it.

> People are vastly underestimating the rate of change here

GPT-3.5 was released in March 2022. We are now in June 2024. Over 2 years later.

And on average GPT-4 is about 40% more accurate.

For me, LLMs are very much like self-driving cars. On the journey towards perfect accuracy it gets progressively harder to make advancements.

And for it to replace the status quo it really does need to be perfect. And there is no evidence or research that this is possible.

Its enough to decrease the amount of ppl you need in IT by a factor of 20-30%.

Ppl dont want to hear that, but you see less and less offers and not only for junior positions.

Hard truth is that like with any tool/automation - the higher performance improves, the less ppl are needed for this kind of work.

Just look at how some parts of manual labor were made redundant.

Why ppl think it wont be the same with mental work is beyond me.

Not yet, because the reliability isn't there. You still need to validate everything it does.

E.g. I had it autocompleting a set of 20 variable#s today Something like output.blah=tostring(input[blah]). The kind of work you give to a regex.

In the middle of the list, it decides to go output.blah=some long weitd piece of code, completely unexpected and syntactically invalid.

I am still in my AI evaluation phase, and sometimes I am impressed with what it does. But just as possible is an unexpected total failure. As long is it does that, I can't trust it.

>On any topic that I understand well, LLM output is garbage: it requires more energy to fix it than to solve the original problem to begin with.

Is it generally because the LLM was not trained on that data, therefore have no knowledge of it or because it can't reason well enough?

LLMs don't and are not built to reason, they are next token predictors.