Hacker News new | ask | show | jobs
by ryeguy_24 329 days ago
Isn’t there a whole bunch of dependency here related to prompting and methodology that would significantly impact overall performance? My gut instinct is that there are many many ways to architect this around the LLMs and each might yield different levels of accuracy. What do others think?

Edit: In reading more, I guess this is meant to be a dumb benchmark to monitor through time. Maybe that’s the aim here instead of viability as an auto close tool.