| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by diggan 309 days ago
	Indeed, I've also found that various models are good at various tasks, but I have yet been able to categorize "Model X is good at Y-class of bugs", so I end up using N models for a first pass "Find the root-cause of this issue", then once it's found, pass it along to same N models for them to attempt to solve it. So far, which model can find/solve what is really scattered all over the place.

2 comments

irthomasthomas 309 days ago

You are experiencing the jagged skills frontier. All models have these weird skill gaps and prompt phrasing sensitivity. This is the main problem solved by an llm-consortium. It's expensive running multiple models in parallel for the same prompt, but the time saved is worth it for gnarly problems. It fills in the gaps between models to tame the jagged frontier.

My very first use of the llm-consortium saw me feeding in it's own source code to look for bugs. It surfaced a serious bug which only one out of the three models had spotted. Lots of problems are NP-ish so parallel sampling works really well. Googles IMO gold and openais IOI gold both used parallel reasoning of some sort.

energy123 309 days ago

This is so true. Another thing, a model might be better at something in general, but worse if the context is too long. Looking at how GLM-4.5 is trained, on lots of short context, this may be the case for it.

GPT-5: Exceptional at abstract reasoning, planning and following the intention behind instructions. Concise and intentional. Not great at manipulating text or generating python code.

Gemini 2.5 Pro: Exceptional at manipulating text and python, not great at abstract reasoning. Verbose. Doesn't follow instructions well.

Another thing I've learned is that models work better when they work on code that they themselves generated. It's "in distribution" and more comprehensible to them.