Hacker News new | ask | show | jobs
by barrell 36 days ago
Azure recently discontinued the gpt-4.1 model. I had to move off of this model, and moving to any gpt-5* model was worse (higher failures & less accuracy), and more expensive. I had to rewrite the entire system from high school level prompts to lower elementary school level prompts using non-gpt models.

I would say models entered a bottleneck a long time ago. My personal opinion is now they are overfitting newer models on coding and "agentic" capabilities at great expense of general abilities in other domains.

3 comments

I am wondering if everyone is moving to an IPO and striking these bizarre circular deals because they’ve hit the ceiling on what can be done with more compute until a major architectural innovation happens.

Still amazing, but 5.5 does feel like incremental progress with a massive up charge.

Ofc they have hit a ceiling, why do you think OAI has shut down many of its projects like the research one called Prism?

The reality is both Anthropic and OAI have converged on LLMs as being a thing for software production - that's where the majority of their revenue is coming from.

Can you elaborate what kind of system you built? I'm curious what specific prompts are getting worse responses with the newer models.
Linguistics, specifically as it pertains to language learning

Edit: Whoops read your question wrong. I do a bunch of NLP on different languages, and use LLMs to pad out and interpret the data. Asking for things like translations, alternatives, transliterations; associating and validating data; transferring data from one language to another; segmentation and cross lingual alignment; the list goes on.

I did manage to get higher quality in the end, so it’s not entirely a regression. But older LLMs were much more capable with less prompting at interpreting disparate data and tying it together.

Most of the work I do does not really have a “right answer,” just a lot of wrong ones, which I think is what trips up LLMs. If I turn on reasoning for any step in my pipeline, the token count goes up 100 fold and the quality gets cut in half.

Edit 2: I did have to move off of GPT though to get the improvements mentioned. Go mistral!

What kind of data are you interpreting? Do you mean document extraction from different languages? I have only used GPT5.5 for agentic coding, which did get significantly better from my experience, although that does align with your conjecture of their focus being on improving this. I haven't noticed a regression when it comes to interacting with it in different languages though (specifically German and Russian). I have done data extraction from documents in different languages, but only with locally hosted LLMs (mainly Qwen3.5-397b) as I cannot legally use cloud-based solutions. My local solution was more than sufficient, so I would be surprised if a frontier model would fail at that.
I actually think it makes sense to hone models for coding and agentic capabilities. Those models will be specialized for those tasks, and the results will be cheaper and better. We can still have a general model and specialized models