|
|
|
|
|
by marv1nnnnn
403 days ago
|
|
I totally agreed with your critic. To be honest, it's even hard for myself to evaluate.
What I do is select several packages that current LLM failed to handle, which are in the sample folder, `crawl4ai`, `google-genai` and `svelte`. And try some tricky prompt to see if it works.
But even that evaluation is hard. LLM could hallucinate. I would say most time it works, but there are always few runs that failed to deliver.
I actually prepared a comparison, cursor vs cursor + internet vs cursor + context7 vs cursor + llm-min.txt. But I thought it was stochastic, so I didn't put it here. Will consider add to repo as well |
|
You can use success rate % over N runs for a set of problems, which is something you can compare to other systems. A separate model does the evaluation. There are existing frameworks like DeepEval that facilitate this.