| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ModernMech 356 days ago
	I notice you don't bring any examples despite claiming the improvements are frequent and solid. It's likely because the improvements are actually hard to define and quantify. Which is why throughout this period of LLM development, there has been such an emphasis on synthetic benchmarks (which tell us nothing), rather than actual capabilities and real world results.

1 comments

nonhaver 356 days ago

i didnt bring examples because i said personal experience. heres my "evidence" - gpt 4 took multiple shots and iterations and couldnt stay coherent with a prompt longer than 20k tokens (in my experience). then when o4 came out it improved on that (in my experience). o1 took 1-2 shots with less iterations (in my experience). o3 zero shots most of the tasks i throw at it and stays coherent with very long prompts (in my experience).

heres something else to think about. try and tell everybody to go back to using gpt-4. then try and tell people to go back to using o1-full. you likely wont find any takers. its almost like the newer models are improved and generally more useful

link

ModernMech 356 days ago

Why are your examples so vague?

I'm not saying they're not delivering better incremental results for people for specific tasks, I'm saying they're not improving as a technology in the way big tech is selling.

The technology itself is not really improving because all of the showstopping downsides from day one are still there: Hallucinations. Limited context window. Expensive to operate and train. Inability to recall simple information, inability to stay on task, support its output, or do long term planning. They don't self-improve or learn from their mistakes. They are credulous to a fault. There's been little progress on putting guardrails on them.

Little progress especially on the ethical questions that surround them, which seem to have gone out the window with all the dollar signs floating around. They've put waaaay more effort into the commoditization front. 0 concern for the impact of releasing these products to the world, 100% concern about how to make the most money off of them. These LLMs are becoming more than the model, they're now a full "service" with all the bullshit that entails like subscriptions, plans, limits, throttling, etc. The enshittification is firmly afoot.

link

nonhaver 356 days ago

not to offend - but it sounds like your response/worries are based more on an emotional reaction. and rightly so, this is by all means a very scary and uncertain time. and undeniably these companies have not taken into account the impact their products will cause and the safety surrounding that.

however, a lot of your claims are false - progress is being made in nearly all the areas you mentioned

> hallucinations

are reduced with GPT-5

https://cdn.openai.com/pdf/8124a3ce-ab78-4f06-96eb-49ea29ffb...

"gpt-5-thinking has a hallucination rate 65% smaller than OpenAI o3"

> limited context window

same deal. gemini 2.5-pro has a 1 million token context window and GPT-5 is 400k up from 200k with o3

https://blog.google/technology/google-deepmind/gemini-model-...

"native multimodality and a long context window. 2.5 Pro ships today with a 1 million token context window (2 million coming soon)"

> expensive to operate and train

we don't know for certain but GPT-5 provides the most intelligence for the cheapest price at $10/1 million output tokens which is unprecedented

https://platform.openai.com/docs/models/gpt-5

> guardrails

are very well implemented in certain models like google who provide multiple safety levels

https://ai.google.dev/gemini-api/docs/safety-settings

"You can use these filters to adjust what's appropriate for your use case. For example, if you're building video game dialogue, you may deem it acceptable to allow more content that's rated as Dangerous due to the nature of the game. In addition to the adjustable safety filters, the Gemini API has built-in protections against core harms, such as content that endangers child safety. These types of harm are always blocked and cannot be adjusted."

now id like to ask you for evidence that none of these aspects have been improved - since you claim my examples are vague but make statements like

> Inability to recall simple information

> inability to stay on task

> (doesn't) support its output

> (no) long term planning

ive experienced the exact opposite. not 100% of the time but compared to GPT-4 all of these areas have been massively improved. sorry i cant provide every single chat log ive ever had with these models to satisfy your vagueness-o-meter or provide benchmarks which i assume you will brush aside.

as well as the examples ive provided above - you seem to be making claims out of thin air and then claim others are not providing examples up to your standard.

link

rustystump 355 days ago

Big claims of prs and shipped code then links to people who are financially interested in hype claims.

Not saying things are not getting better but i have found that those that claim amazing results are from people who are not expert enough in the output of the given domain to comment on the actual quality of output.

I love vibing out rust and it compiles and runs but i have no idea if it is good rust because well, i barely understand rust.

link

ModernMech 355 days ago

> now id like to ask you for evidence that none of these aspects have been improved

You're arguing against a strawman. I'm not saying there haven't been incremental improvements for the benchmarks they're targeting. I've said that several times now. I'm sure you're seeing improvements in the tasks you're doing.

But for me to say that there is more a shell game going on, I will have to see tools that do not hallucinate. A (claimed, who knows if that's right, they can't even get the physics questions or the charts right) reduction of 65% is helpful but doesn't make these things useful tools in the way they're claiming they are.

> sorry i cant provide every single chat log ive ever had with these models to satisfy your vagueness-o-meter

I'm not asking for all of them, you didn't even share one!

Anyway, I just had this chat with the brand new state of the art Chat GPT 5: https://chatgpt.com/share/68956bf0-4d74-8001-88fe-67d5160436...

Like I said, despite all the advances touted in the breathless press releases you're touting, the brand new model is just a bad roll away from like the models from 3 years ago, and until that isn't the case, I'll continue to believe that the technology has hit a wall.

If it can't do this after how many years, then how is it supposed to be the smartest person I know in my pocket? How am I supposed to trust it, and build a foundation on it?

link

rustystump 355 days ago

Interesting thread. I think the key around hallucinations is analogous to compilers. In order for output to be implicitly trusted it has to be as stable as a compiler. Hallucinations mean i cannot yolo trust the output. Having to manually scan the code for issues defeats the fundamental benefit.

Compilers were not and are not always perfect but i think ai has a long way to go before it passes that threshold. People act like it will in the next few years which the current trajectory strongly suggests that is not the case.

link

nonhaver 355 days ago

ill leave it at this: if “zero-hallucination omniscience” is your bar, you’ll stay disappointed - and that’s on your expectations, not the tech. personally i’ve been coding/researching faster and with fewer retries every time a new model drops - so my opinion is based on experience. you’re free to sit out the upgrade cycle

link