| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by danoandco 112 days ago
	Yes, this is the pass@k metric from code generation research. Found the relevant paper Evaluating Large Language Models Trained on Code (Chen et al., 2021) which introduced the metric.

1 comments

hmokiguess 112 days ago

Interesting, and how does Twill uses it in that feature?

link

danoandco 111 days ago

On the Twill web app, you can run the same task across different agents and multiple attempts (each in its own sandbox). Then you pick the best result. This is super handy for UI work where you can open the live preview for each attempt and compare. Next step for us is adding a final pass where an agent evaluates the results and combines the best parts into one PR.

link