Hacker News new | ask | show | jobs
by awoimbee 11 days ago
The benchmarks aren't great, they're super specific to sem's output: why would I ask Claude how many "entities" were modified by a commit and do I need a tool specifically for this request ? Note that an "entity" is a sem-specific concept...
1 comments

Thanks for pointing it out. I agree with you here, my testing process was quite specific to sem's output but also would love any suggestion from you of how you would design the whole testing process for this kind of tool?

I can also give my thought process, because I was more interested in figuring out the model's inherent search results and understanding without sem.