for pass@1 HumanEval tells how well the model solves a task from a set, given only one chance to solve it. It's not the perfect metric, there're other like DS-1000, MBPP (we have included them on HuggingFace model card). HumanEval is good for benchmarking with other models as it gives a fast idea how powerful the model is.
my understanding is that there are 2 usages of the pass@{number} syntax. the HumanEval/Codex paper interprets the {number} as number of attempts[0]. however language modelers seem to use it to denote the number of few shot example demonstrations given in the context. these are starkly different and i wish the syntax wasnt overloaded
> Kulal et al. (2019) evaluate functional correctness using
the pass@k metric, where k code samples are generated
per problem, a problem is considered solved if any sample
passes the unit tests, and the total fraction of problems
solved is reported.