| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by flexagoon 3 days ago
	I'm using Deepseek-v4-pro as my main model and this is sometimes pretty annoying, I have to do some easy boring task, think "I'll just leave the agent to do it and go take a nap", but it's already done writing the code before I even walk away from the computer

10 comments

throwaway67678 3 days ago

Agent mania setting in

It's also pretty funny sometimes how it gives weird future roadmap estimates ("part 2 - 3 weeks, part 3 - 2 months", etc.) and when you tell it to actually do those changes it's pretty much done in half an hour

link

smith7018 3 days ago

I've long believed those numbers were faked by Anthropic/OpenAI to serve as a form of advertisement. The estimates are impossible to verify and their ability to do "2 days of work" in 10 minutes will presumably make the user go "Wow, I just saved SO much time!" Plus, the unnecessary text eats up the users' tokens so it helps the companies on the backend, as well.

link

overgard 3 days ago

I tend to be cynical about AI companies, but I'm guessing the bad estimates more just come from a complete lack of actual data it could use for that so it's more or less a hallucination.

link

leodavi 3 days ago

I agree with you that labs are benefiting from those outputs but I'm skeptical that labs are purposefully training the models to produce those outputs.

Raw pre-training data includes plenty of conversations between professional builders and some of those include estimates.

I believe the outputs are a training coincidence with consequences that are opportunitistic for the labs.

link

AgentMasterRace 3 days ago

All the models have broken estimates. They're trained heavily on jira and GitHub tasks and issues, that's why their estimates are human.

link

esperent 3 days ago

Even for humans the estimates are way off, unless it's based on data that has some serious padding.

That said, it'll often say "2 days of work" and then complete the coding in 30 minutes, and while that's amusing, afterwards, I'll need to manually test, or send to other people for review, or realize the agent only actually did half the work and I need to do a second pass (or a third etc.) and then often getting the feature in does genuinely take two days.

link

Terretta 3 days ago

> the estimates

It doesn't estimate.

It generates tokens that read like estimates associated with the context in its training material.

What would you expect the generator to output instead?

link

legulere 3 days ago

It generates tokens by estimating what the next token is going to be.

Sure it cannot think like a human, but given it's input, it should give a good statistical answer (approximating not of how long it actually takes, but what a human would say how long it takes).

link

mediaman 3 days ago

The funny thing about this comment is that neural networks are universal function approximators.

The most fundamental essence of what they do is exactly what you say they don't: estimate.

link

airstrike 3 days ago

Funny and ironic in a way, but the point still stands that they do not actually estimate the time it will take.

link

greenavocado 3 days ago

> they do not actually estimate the time it will take

You can't prove that )))

link

taneq 3 days ago

Therein lies the rub, no? To accurately predict the next token produced by a process, it’s necessary to model that process. If the process is a human attempting to estimate the duration of a task, then in some sense the LLM is modeling the estimation process. We’re well past the point where it’s credible to claim that LLMs just regurgitate their training data.

link

incr_me 3 days ago

Obviously there isn't a hidden corpus of logs of coding chatbot assistants that has been accumulating over the years, but these coding chatbot assistants output tokens that resemble how we all imagined a coding chatbot assistant would have operated had it existed in the first place to end up in a corpus. "Training material" includes supervised fine-tuning, preference training, RLHF, and so on, so that certain outputs (like these timeline estimates) may really have been decided (at some level of conscious awareness) by product teams.

link

carterschonwald 3 days ago

you might like the stuff in my work of oh my pi, its a test bed for my ideas around making these tools more reliable. hoping to maybe have a native ui iter of the real thing that this is a test bed for this summer.

https://github.com/cartazio/oh-punkin-pi/blob/main/scripts/b...

link

InterviewFrog 3 days ago

This is so 2023. The thought process.

At that time the predominant view was that LLMs were nothing but stochastic parrots, that they would plateau, and that hallucinations couldn't be fixed.

At this point I doubt there are any AI sceptics left. That ship has long sailed. The only thing that matters is whether the estimates are accurate, and AI can improve on that too.

Even humans only estimate based on neurons firing in prior patterns.

link

nl 3 days ago

Actually in this case they possibly are estimates.

It's been known for some years[1] that LLMs do regression in-context. Frontier models have been trained against many, many issue text that include task break downs and estimates.

[1] https://arxiv.org/html/2409.04318v1

link

kube-system 3 days ago

Interesting. So it may have learned how to estimate as a human but doesn’t understand that it doesn’t operate at that speed :D

I wonder if there’s a reasonable way to give an llm parameters that give it a concept of its own execution speed. Seems that could be useful for multiple purposes

link

nl 2 days ago

Yes, it's entirely possible to do that via RL. It'd be a fun little project you could do for less than $100 on a small LLM actually.

link

ghshephard 3 days ago

I think people are continuing to view these systems as pure LLMs - when that ship sailed 6+ months ago. Between being able to review memory, using agent harnesses and sub agents and skills to go out and discover information - modern systems (Codex, Claude Code, Cursor) - use LLMs - but the LLM is only a small component of it. Compare what you get from sending a request to a chatbot like ChatGPT - to what you can from a modern harness. The output is influenced by the LLM, but it's no longer a "model making a token prediction based on training material and RLHF" - that's a very 2025 way of looking at these systems.

Even Gary Marcus is starting to come around and realize that his priors are no longer as relevant as they once were.

link

irthomasthomas 3 days ago

No one is bitter lesson pilled anymore. Everyone is pivoting to neurosymbolic systems. It looks like Gary Marcus was right.

link

nl 3 days ago

> No one is bitter lesson pilled anymore.

Will the 10T parameter Mythos model be released this month or next month?

They better soon because it is generally accepted that one of the reasons GPT 5.5 is better at hard tasks than Opus is because of its parameter size - and that Opus 4.8 remains competitive only be scaling test-time compute (see how many more tokens it uses than GPT 5.5)

https://www.reddit.com/r/LLM/comments/1sz8bjz/parameter_esti...

link

wild_egg 3 days ago

How is neurosymbolic not aligned with the bitter lesson? The bitter lesson is completely agnostic to architecture.

link

Terretta 3 days ago

You think someone is, or even should, special case things like estimates? What else deserves that level of intervention so they look less dumb?

Logistics for getting to the car wash next door?

In the mean time, alas, no, we can see from actual prompts sent directly or through sub-agents, and actual replies, estimates remain LLM generated.

Though, this discussion here could change that, because indeed there is a lot of special casing and context stuffing going on, one of the oldest being today's date for example.

• • •

I did read the Claude Code leak, and use pi, etc. So I disagree with your premise rather strongly. Today's "systems" remain, roughly, piles of markdown and context engineering wrapped in UI affordances, and behave very similarly today to how they did in 2024 for those already engineering context and delegating.

link

ghshephard 3 days ago

I do a lot of code bisecting with Claude Code - and it spends hours running experiments - looking at experiment results, making guesses as to what to try next for an experiment - until it eventually comes around to a working code pattern. I mean - maybe this is as much a reflection on me as anything else - but it's pattern of logic isn't that much different from what I would do. It knows, in general, what tools and APIs it can call - it tries something - observes the result, and then comes back and tries different experiments based on success/failure - mostly efficiently bisecting to a solution.

I'm still lower-down of the capability scale - as I'm still manually directing agents to do these wiggins loops - obviously the next step up is to direct the code-loops which control the agents. I just haven't got my tooling nailed in place to the point where I find that's more productive.

I actually might agree with you that this is mostly just "next token prediction" - if I can concede that's really all I do as well.

link

8note 3 days ago

rather than special casing, make real data based on chat logs for how long things took both in calendar and chat time

link

dizhn 3 days ago

All models do it. It's their training. They didn't have "a person does this in a week but an LLM could in a minute" in their training yet. They also don't have the concept of elapsed time unless you ask them how long something has taken.

link

Narciss 3 days ago

Nah it’s all from the pretraining data

link

BobbyTables2 3 days ago

That’s right up there with Scotty in the classic Star Trek always multiplying time estimates by 4 so he looks like a “miracle worker”

link

KronisLV 3 days ago

I mean in general I'd rather take slightly inflated estimates than the odd sprint poker stuff where other devs and PMs negotiate hours down and before you know it you're also stuck fixing nitpicky reviewer comments on code that is already good enough and have to send a release at like 7 PM, ofc also without enough tests or even enough manual checks and testing, cause people repeatedly act against their self-interest and try to compress timelines, thinking that that's somehow good for them.

At least with AI that actually does things more quickly, there is a bit more breathing room (introducing AI is easier than changing a given environment).

Aside from that, I wonder how much variety there is in practice: between "Oh yeah, I added that new button while we were in the meeting" and "The new button feature will be ready in Q3 according to the roadmap, once we have sign-off from all the stakeholders."

link

andai 3 days ago

I heard an anecdote. Guy spent several days trying to convince his AI agent to build a feature. Kept saying it was crazy complicated, would take weeks.

Finally he convinced it to try. It one shotted it in 30 seconds.

Turns out the agents' idea of what is hard and easy also comes from Common Crawl.

link

wild_egg 3 days ago

Why on earth would you spend any time at all convincing an agent of anything? You say "just do it" and off it goes.

link

dr_dshiv 3 days ago

Ya, but “doit” is 2x more efficient

link

brianwawok 3 days ago

Uh Claude tries real hard to dodge work. Talks about how it’s really hard 10 PRs. Finally convince it to do as 1. It stops 10% through and says ok done with PR 1, we can work on the last 9 tomorrow. Ugh.

link

handfuloflight 2 days ago

Maybe we shouldn't have AI mimic humans too closely?

link

g8oz 2 days ago

You need to assert dominance.

link

znpy 2 days ago

> It's also pretty funny sometimes how it gives weird future roadmap estimates ("part 2 - 3 weeks, part 3 - 2 months", etc.)

those estimates are based on previous human estimates (the datasets it's been trained on).

unironically, when your comments will become part of a dataset, LLMs will likely get much better at estimating.

now that i think about it, all these writings about LLMs will give LLMs something much like meta-cognition.

link

throw1234567891 3 days ago

It repeats what it has seen in the training data. Expecting it to reason about the complexity of a task is a pipe dream. The best is to tell it not to come back with estimates, and when it does, remove them anyway.

link

andai 3 days ago

I added "you can do anything, believe in yourself" to system prompt, and task completion increased significantly.

link

jimbokun 2 days ago

Well how else could I keep my reputation as a miracle worker Captain?

link

SwellJoe 3 days ago

DeepSeek is the fastest model in the benchmarks I've been doing (https://swelljoe.com/post/will-it-mythos/). Followed not so closely by Opus 4.8 and even less closely by Gemini 3.5 Flash and GPT 5.5. I've been really impressed with it, so far. It's also among the best at doing the work, though still trailing the frontier models from Anthropic and OpenAI.

link

anschl 2 days ago

Nice benchmark, thanks! Which quants did you choose for the self hosted models?

link

SwellJoe 2 days ago

8-bit on that one (unsloth 8_K_XL). But, the next post compares all common quantizations of Qwen 3.6.

I have another coming in a day or so for Gemma 4 with the 4-bit QAT version, which is very surprising (in a good way, Gemma 4 is impressive for this task).

link

RussianCow 3 days ago

Do you mean Flash and not Pro? I haven't tried it personally, but according to OpenRouter, the fastest DeekSeep V4 Pro providers are only ~50tps. That's slower than Claude Opus.

https://openrouter.ai/deepseek/deepseek-v4-pro?sort=throughp...

link

SwellJoe 3 days ago

In recent benchmarking I've been doing, DeepSeek V4 Pro was the fastest of 21 models, by a comfortable margin (https://swelljoe.com/html/bench-report-final.html). Faster than Claude Opus 4.8, which was the second fastest (Mistral doesn't count because it seems to have refused to participate). But, it's a limited data set, just a few benchmark runs of a limited set of tasks. It's entirely possible I happened to be calling the API at its least busy time and maybe Claude got hit during a busy time.

link

sarjann 3 days ago

I don't think token speed matters as much when a lot of tokens are needed to achieve a task. E.g. artificial analysis benchmarks where deepseek v4 is one of the biggest token burners to go through the benchmark.

link

brianwawok 3 days ago

Both matter.

link

flexagoon 3 days ago

No, I mean Pro. I use it through OpenCode Go so I don't know what provider it uses under the hood, but it's very fast in my experience.

link

thecopy 2 days ago

DS through OpenRouter is significantly slower than direct from DS platform in my experience

link

specproc 3 days ago

Yeah, flash is crazy fast, but I've found performance variable.

link

binary0010 3 days ago

Flash is amazing if you know the domain really well.

E.g. occasionally it makes the dumbest mistakes you've ever seen and can't correct them. However it's fairly rare, and if you know the domain really well, occasionally popping in the code and pushing it towards the correct solution takes like 20seconds or whatever.

So the speed you can move with flash + high domain knowledge beats opus by a mile in my experience.

I tried to switch back to 4.8 for a bit when it came out, feels so bad waiting 20mins for a mediocre solution when I could have had everything complete - with multiple iteration cycles - in flash in like 3-5mins.

link

addozhang 3 days ago

Yes, you don't need much domain knowledge to use Opus, but it's just way too expensive.

link

59nadir 2 days ago

For losers who can't put together a program to save their life, have no real skills and were always not really interested in programming (hence their poor skills), renting a robot buddy to do it for them is a good deal, until the buddy cuts in materially into their salary, and until their bosses realize that they really just have robot operators on staff instead of people who can actually do things.

link

Induane 2 days ago

It's nice when I want to be lazy though.

Or when I'm working two contract gigs. I can spec things out for one and turn it loose and trust it. Then work more closely with deepseek on the other project.

link

binary0010 3 days ago

I exclusively use deepseek v4 flash now, completely stopped using slow models like Claude.

Basically I never have to wait - yes I have to tell it little corrections occasionally (but I know the domain really well so that's not an issue), but it's so much faster than anything else it's kinda crazy. I love the super fast speeds with high involvement development cycle.

I actually enjoy using agentic development flows for the first time now - whereas with Claude I absolutely hated it. That 5 to 20 min wait after every prompt absolutely killed my desire to even want to work at all.

link

znpy 2 days ago

> I have to do some easy boring task, think "I'll just leave the agent to do it and go take a nap", but it's already done writing the code before I even walk away from the computer

the way software engineering works these days reminds me a lot of factory workers on production lines that just sit in front of a production line all day and take out faulty items and/or perform a single step in the production of goods.

link

abustamam 2 days ago

Take the nap anyway, just say it took all afternoon :)

link

throw-the-towel 3 days ago

FWIW, for me just today it got itself into silly rabbit holes twice, and both times I had to fix things myself. Scarily, this is something I catch myself doing as well.

link

tmaly 3 days ago

This reminds me of the Peter / Boris comments on writing loops to keep the agents busy.

link

behnamoh 3 days ago

Same. How can DeepSeek serve the V4-Pro at such high speeds despite the sanction?

link

rubyn00bie 3 days ago

The sanctions only “prevent” them from directly buying NVidia’s latest and greatest in the sense that NVidia can’t sell directly to them. Essentially, there are companies now who are in a country without the sanctions, they buy from NVidia (or a partner), and then ship them off to China. For the orgs in China doing this, there’s zero legal risk besides having foreign customs service intercept the shipment and losing the goods. For NVidia there is zero incentive to care, as long as they look like they do, because sales are sales. You can bet Jensen ain’t losing sleep over it.

GamersNexus had a really good investigative piece (~3hrs long) on this where they went to China and met with grey market sellers. That piece absolutely pissed off NVidia and resulted in a fight with Bloomberg too.

Deepseek may be also be running inference on oodles of Chinese hardware but it wouldn’t surprise me for a second if they just acquired Blackwell chips through the grey market. The original Deepseek models were all trained using NVidia chips if I remember right.

link

seewhydee 3 days ago

That wouldn't explain why Deepseek is fast relative to other Chinese providers, especially considering that they're reportedly ahead of the curve among Chinese companies in moving off Nvidia. I think their quant fund background has more to do with it. Their models are clearly designed with performant inference clearly in mind.

link

ljosifov 2 days ago

Yes, it's performant, and esp performant at non-trivial context depths. DeepSeek-V4 DS4 (and Flash - DS4F) drop tok/s speed much less than the rest. On my M2 Max it took context depths of 768K to drop tok/s to ~10 tok/s.

https://x.com/ljupc0/status/2062457314414587996

Other local models I've checked drop to unusable speeds way sooner. Only other model with similarity favourable curve I've tried is nemotron-cascade-2-30b-a3b. But it's a small model, way dumber than DS4F.

Coding agents use cases have large context depths. The rate of decline is as important as the headline number.

link

andai 3 days ago

With Flash it's basically instant for smaller tasks, yeah.

link