|
|
|
|
|
by bfogelman
1027 days ago
|
|
Glad this work is happening! That said, HumanEval as the current gold standard for benchmarking models is a crime. The dataset itself is tiny (around 150) examples and all the problems themselves aren’t really indicative of actual software engineering problems. Also, we’ve been able to get around 85% pass@1 on GPT-4 internally as of a couple weeks ago. It’s hard to say if they’ve contaminated the models with RLHF though. It still is exciting how close we’re getting with open source models but we’ve still got a decent amount of work to go! |
|
We're working hard to use these advances to make models that are production ready. One such idea is to run a mixture of experts on various fine-tuned CodeLlamas.