Hacker News new | ask | show | jobs
by appleorchard46 502 days ago
These posts about X task LLMs fails at when you give it Y prompt are getting more and more silly.

If you ask an AI to analyze some data, should the default behavior be to use that data to make various types of graphs, export said graphs, feed them back in to itself, then analyze the shapes of those graphs to see if they resemble an animal?

Personally I would be very annoyed if I actually wanted a statistical analysis, and it spent a bajillion tokens following the process above in order to tell me my data looks like a chicken when you tip it sideways.

> However, this same trait makes them potentially problematic for exploratory data analysis. The core value of EDA lies in its ability to generate novel hypotheses through pattern recognition. The fact that both Sonnet and 4o required explicit prompting to notice even dramatic visual patterns suggests they may miss crucial insights during open-ended exploration.

It requires prompting for x if you want it to do x... That's a feature, not a bug. Note that no mention of open-ended exploration or approaching the data from alternate perspectives was made in the original prompt.

4 comments

I think it depends if one is using “AI” as a tool or as a replacement for an intelligent expert? The former, sure, it’s maybe not expected, because the prompter is already an intelligent expert. If the latter, then yes, I think, because if you gave the task to an expect and they did not notice this, I would consider them not good at their job. See also Anscombe's quartet[1] and the Datasaurus dozen[2] (mentioned in another comment as well).

[1]: https://en.wikipedia.org/wiki/Anscombe's_quartet [2]: https://en.wikipedia.org/wiki/Datasaurus_dozen

This is true, but I would replace 'intelligent expert' with 'intelligent human expert'.

Graphing data to analyze it - and then seeing shapes and creatures in said graph - is a distinctly human practice, and not an inherently necessary part of most data analysis (the obvious exception being when said data draws a picture).

I think it's because the interface uses human language that people expect AI to make the same assumptions and follow the same processes as humans. In some ways it does, in other ways it doesn't. Expecting it to be the same as a human leads to frustration and a flawed understanding of its capabilities and limits.

> Graphing data to analyze it - and then seeing shapes and creatures in said graph - is a distinctly human practice, and not an inherently necessary part of most data analysis...

I disagree. Even apart from the obviously silly dinosaur and star in the Datasaurus Dozen, the othe plots depict data sets which are clustered in specific ways which point clearly to something unusual going on in the data. For instance, no competent analysis of the "dots" data set would fail to call out that the points were all clustered tightly around nine evenly spaced centers. Whether you come to that conclusion through numerical analysis or by looking at a graph is immaterial, but, at least for us meatbags, drawing a graph is highly effective.

> Whether you come to that conclusion through numerical analysis or by looking at a graph is immaterial, but, at least for us meatbags, drawing a graph is highly effective.

This is what I was trying to say - some things that are extremely helpful for humans (i.e. making graphs) might not be as necessary for AI, so asking a question and expecting a response contingent upon the particular way humans approach a problem is unlikely to get the results desired.

  > not an inherently necessary part of most data analysis
You do realize that the LLMs did not find the data suspicious, right? I think your answer is appropriate if they answered (without follow-up prompting which is leaking information to the LLM!) that the data was suspicious. But in fact, all models are saying that the data is normally distributed. Sure, the author said this, but they confirmed it. If you run normaltest on any BMI or steps, you'll find that they are very NOT normal. In fact, you can also see this from the histograms.

So honestly, this isn't even about the Gorilla. You're hyper focused there because you're looking for a way to make the LLM right while not looking for why the LLM got it wrong (it did, there's no denying it, so we should understand why it is wrong, right?). The problem isn't so much about expecting it to be human, the problem is if it can do data analysis. The problem here is that the LLM will not correct you, it will not "trust but verify" you. It is a "yes man" and is trained to generate outputs that optimize human preference. That last part alone should make you extremely suspicious, as it means when it is wrong, it is more likely to be in exactly the way you won't notice.

You don't think "Examine the data" and "Which other conclusions can you draw from the data?" are open-ended?

And even when explicitly prompted to look at the plot, they only brush up against the data anomalies rather than properly analyzing the plot.

I tried it on gpt4o with an upload of the image and "What do you see?" as prompt and it said "monkey". So ymmv, these tools can't be evaluated with just a bunch of gotcha prompts and ignorance of how to use them effectively
It's not a gotcha to give it the data points and ask it to analyze. Uploading this data in image form is effectively a leading question tuned to the specific data, and an analysis tool that needs that kind of leading question is not good at its job.
I don't know why you would expect it to see a gorilla without an image to look at. Humans can't.
Without an image? No, not at all. It's supposed to make its own image. And it did make its own image. But it didn't properly analyze the image it made.
That's a feature that would need to be implemented. There's no reason to think it could look at the image of the plot it generated automatically, but feeding it the image it generated back to it is no different to if it did view it automatically
What’s the point though? That LLMs tend to be constraint by constraints in their prompting? That seems unsurprising.

Humans are visual animals. We can spot a chicken in a graph, but we’re unlikely to be able to tell that a different graph is using XY coordinates to encode a message against a one-time pad. But so what?

I benchmark many of these things as "what would I want a human assistant to do" if they were had insta speed and noticing the pattern would definitely be warranted to determined if data could be falsely generated.

It's not silly at all.

I have to agree with this.

Try sending this graph to an actual human analyst. His response, after you paying him will probably be to cut off any further business relationship with you.