| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Zababa 6 hours ago

My criteria is "do they measure performance, or at least even try to?". Caveman [1], RTK [2] and more recently ponytail [3] don't or use a few trivial tests. Those projects don't measure performance on widely used benchmarks (like SWE Pro and stuff), that have their issues but at least it would give some indication. They also don't measure "big model + caveman vs smaller model".

I've had a few times where removing all custom instructions that I started using with model N-2 made model N perform way better, so I'm very suspicious of everything that changes how the model works, it's easy to get degraded performance silently and suddenly you're paying latest Opus costs for 6 months old Sonnet performance.

[1]: https://github.com/JuliusBrussee/caveman

[2]: https://github.com/rtk-ai/rtk

[3]: https://github.com/DietrichGebert/ponytail