|
|
|
|
|
by ai-christianson
450 days ago
|
|
> The most important thing is doing rigorous evals that are representative of what your users do in your product. Often this is not the same as academic benchmarks. OMFG thank you for saying this. As a core contributor to RA.Aid, optimizing it for SWE-bench seems like it would actively go against perf on real-world tasks. RA.Aid came about in the first place as a pragmatic programming tool (I created it while making another software startup, Fictie.) It works well because it was literally made and tested by making other software, and these days it mostly creates its own code. Do you have any tips or suggestions on how to do more formalized evals, but on tasks that resemble real world tasks? |
|
And before going to crowd-workers (maybe you can skip them entirely) try LLMs.