| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by swapnilt 247 days ago
	Opus 4.5's scaling is impressive on benchmarks, but the usual caveats apply: benchmark saturation is real, and we're seeing diminishing returns on evals that test pattern-matching vs. genuine reasoning. The more relevant question: has anyone stress-tested this on novel problems or complex multi-step reasoning outside training data distributions? Marketing often showcases 'advanced math' and 'code generation' where the solutions exist in training data. The claim of 'reasoning improvement' needs validation on genuinely unfamiliar problem classes.