Hacker News new | ask | show | jobs
by kibbi 700 days ago
Am I blind or is there no mention at all of the GPT model he used?

The author states his conclusions but doesn't give the reader the information required to examine the problem.

- Whether the article to be summarized fits into the tested GPT model's context size

- The prompt

- The number of attempts

- He doesn't always state which information in the summary, specifically, is missing or wrong

For example: "I first tried to let ChatGPT one of my key posts (...). ChatGPT made a total mess of it. What it said had little to do with the original post, and where it did, it said the opposite of what the post said." He doesn't say which statements of the original article were reproduced falsely by ChatGPT.

My experience is that ChatGPT 4 is good when summarizing articles, and extremely helpful when I need to shorten my own writing. Recently I had to write a grant application with a strict size limit of 10 pages, and ChatGPT 4 helped me a lot by skillfully condensing my chapters into shorter texts. The model's understanding of the (rather niche) topic was very good. I never fed it more than about two pages of text at once. It also adopted my style of writing to a sufficient degree. A hypothetical human who'd have to help on short notice probably would have needed a whole stressful day to do comparable work.

5 comments

You write as if you’ve found a hole in the article’s argument. The lack of evidence is a hole in the reporting, for sure. The tone of your comment suggests you feel that by not publishing all their evidence, the author’s point is wrong (rather than under-justified). However, the example you use to back up your point also backs up the article’s point. The article’s point is that ChatGPT doesn’t summarise, it only shortens. Your example indicates shortening, but not summarising.
There’s just so many articles of people whining about how ChatGPT can’t do things, when they clearly havent prompted it very thoughtfully.

So I think that’s why you see so many reactions like this.

I’ve found chatGPT incredibly good at all sorts of things people say it is bad at, but you need patience and to really figure out the boundaries of the task and keep adding guidance to the prompt to keep it on track.

The article makes it clear that there is a semantic difference between shortening and summarizing and that importantly summarizing requires understanding which ChatGPT most certainly does not have.

One example in the article is that if you have 35 sentences leading up to a 36th sentence conclusion, ChatGPT is very likely to shorten it to things in the earlier sentences and never actually summarize the important point.

which chatgpt ?
It doesn't matter which. The concept of understanding is entirely orthogonal to what an LLM is and how it works. It has no such thing, and can't.
You seem to be on the "statistical next token predictor" side. I'm more.on the side of those who invented it (they should know) that think these machines can understand things
In other news, someone hits a piano five times with a hammer and proclaims pianos are no good at making music.
At what point does it become easier to just do the task yourself? I’ve pondered this question often and came to the conclusion that it’s not worth at the current level of output for me to tinker with it until I get sensible responses.
It depends on the task. Sometimes I have just given up when it really can’t get something.

But other times I’ve persevered and once it’s ‘got’ it, it can then repeat it as many times as I need. That’s the knack really. Get it to the point of understanding and then reuse that infinitely and save yourself a lot of time.

In the example I mentioned, ChatGPT 4 did keep all essential statements of my texts when reproducing shorter versions of them. For example, it often wrote one high-level sentence which skillfully summarized a paragraph of the original text. As far as I understand, this is what the author meant by 'summarizing' vs. 'shortening (while missing essential statements)'.

I was impressed at those high-level summaries. If I had assigned this task to several humans, I'm not sure how many would have been able to achieve similar results.

I agree.

For example, looking at the ChatGPT link the author has, the model loaded 5 pages besides the one the author wanted. That clearly is going to cause some issues but the author didn't modify the prompt to prevent it. It was also a misspelled five (?) word prompt.

I don't see how you can draw conclusions from a model not reading your mind when you give it basically no instructions.

You need to treat models like an new hire you're delegating to and not an omniscient being that reads your intent on it's own.

Why, if the author asks it to summarise a single webpage and gives the link should ChatGPT go out and load 5 more (one is the same page again, the others short overview pages, so won't have influenced the result much)

And why all this talk about trying to engineer a prompt so that in the end the result is good? Should an actual usable system not just handle "Please summarise [url/PDF]"? That is, I suspect, what people expect to be able to do.

Summarize clearly means something different to the author and the people who think the model results are good. Everyone expects different things. Most people are used to others knowing their preferences and adjusting over time. Models do not unless you tell them.
Exactly, 'ChatGPT can't do this that' is way too generic. We can't even be sure if GPT-5 is still the LLM architecture anymore.
I also could not find any mention of the methodology.
Without this detail, the whole body of work is anecdotal.
To be fair, most of the commentary on both sides of the LLM conversation are pretty anecdotal, which is increasingly looking like a structural problem given that any solid evidence goes in the training set in about an hour.
Definitely. Otherwise it would have required a lot more than a single blog post. It is an observation, not anything rigorous with a large number of examples, and decent statistics.
It doesn’t require a whole more. 1) the full transcript of the llm exchange and 2) version numbers for the llm would go a long way.

It is basically a long winded way of saying in a bug report, “it doesn’t work”.

In the comments, the author clarified that he used GPT-4 for the article.

> What the colleague used, I can ask, but I suspect standard ChatGPT based on GPT-4. But my test was with GPT-4 (current standard), so that would mean about 8000 tokens (or roughly 4000 words, I think?). That may have influenced the result.

Unfortunately this piece seems to be an exercise in confirmation bias and not a legitimate scientific inquiry.