Hacker News new | ask | show | jobs
by adinagoerres 327 days ago
Hey HN, I'm Adina, Stefan's co-founder at superglue. When we started working on LLM-powered integrations about a year ago, the models were barely good enough to handle simple mappings. We started benchmarking our performance as an internal evals project and thought it would be fun to open source it, to create more transparency around LLM performance. Our goal here is to understand how we can make agents production-ready and improve reliability across the board.
1 comments

Love the benchmarks. Is better to use single LLM for performance or would always advise to add a self reflection step
self-reflection is very important for both humans and LLMs, indeed