| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by CSMastermind 15 days ago
	DeepSWE is the benchmark you want to actually look out for. Only one that aligns with actual user reported results from trying the models.

1 comments

ryeguy 15 days ago

Did you read the blog post? They compare to deepswe and call it out as the worst one for false positives (failed, but the benchmark assessed it as correct). It also has less language variance.

link

CSMastermind 15 days ago

I mean yes that is what you'd say if you were writing a blog post about your new benchmark.

link

ryeguy 14 days ago

Sure, but they at least quantified it with data. It's not like they just dropped a sentence saying the above, they showed numbers.

link