| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pacjam 225 days ago
	Not sure why Cursor CLI isn't on the leaderboard... I'm guessing it's because Cursor is focused primarily on their IDE agent, not their CLI agent, and Terminal-Bench is an eval/benchmark for CLI agents exclusively. If you're asking about why Letta Code isn't on the leaderboard, the TBench maintainers said it should be up later today (so probably refresh in a few hours!). The results are already public, you can see them on our blog (graphs linked in the OP). They are also verifiable, all data is available for the runs + Letta Code is open source, so you can replicate the results yourself.

1 comments

koakuma-chan 225 days ago

I mean, I understand that this is a terminal benchmark, but the point is to benchmark LLM harnesses, and whether the output is printed to the terminal, or displayed in the UI shouldn't matter. Are there alternative benchmarks where I can see how Letta Code performs compared to cursor?

link

pacjam 225 days ago

Ah gotcha! In that case, I think Terminal-Bench is currently the best proxy for "how good is this harness+agent combo at coding (quantitatively)" question. I think it used to be SWE-Bench, but I think T-Bench is a better proxy for this now. Like you said though, unfortunately Cursor isn't listed (probably their choice to not list it, maybe because it doesn't place highly).

link

koakuma-chan 224 days ago

Alright, I will try out Letta Code manually later then.

link

pacjam 224 days ago

Cool, let us know what you think! Would recommend trying w/ Sonnet/Opus 4.5 or GPT-5.2 (those are the daily drivers we use internally w/ Letta Code)

link