Hacker News new | ask | show | jobs
by lemax 20 days ago
Would love to see this benchmark tested on more perceivably LLM friendly frameworks/ORM (e.g. is NestJS or Drizzle / Kysely more performant than their choice of Sequelize) and more frontier model vs just GPT 5.2.

Anyone read whether these tests include any validation loops? What happens if the models get back test failures, for instance? Understanding how many turns to hit full passing behavior suite would also be interesting. Great methodology in the study though.