Hacker News new | ask | show | jobs
by lumax15 1189 days ago
We don't have any rigorous benchmarks against Copilot, but we're working on building an evaluation framework to do so. We've played a bunch with traditional academic metrics for codegen (e.g. pass @ k) and found they don't correlate super well with real-world performance. Also, want to mention that we are not competing directly with Copilot. A benchmark against Copilot is useful as we further improve our product, but our main value add here is not that we perform better than Copilot, but rather that we serve a customer segment that can't use Copilot. Would love to hear any thoughts you have.

For training, we start with a capable open-source base model, augment it with a bunch of permissively-licensed repos, and then fine-tune on the customer codebase.

We currently support C/C++, Go, Gosu, Java, Javascript, Python, Ruby, and Typescript, but we're continuously adding new languages.

1 comments

What would a benchmark even look like?

Does it make useful code? Does it make the same code?

Or more strictly on something like latency and cost?