Hacker News new | ask | show | jobs
by amasad 1137 days ago
Can't find it now but pretty sure BigCode said somewhere they explicitly looked for it and removed it. Also subjective measure does match up to the benchmark. Our finetuned model performed +50% on HumanEval and then when using it felt at least that much improved.
1 comments

You can view the prompts, solutions, and checks here[0]. See my sibling comment (to yours) where I quote the Human Eval paper and do some more analysis. But I think if you look at [0] you'll see that these aren't really unique problems and are likely to have large repetitions in the dataset. I should add to that comment to include the dataset[1] (too late to edit) where they mention that they just scrape all of GitHub (Jan 1 2015 - Mar 31 2022). They do exact and near de-duplicate but near de-duplication is messy.

> We implement near-deduplication in our pre-processing pipeline on top of exact deduplication. We first split the files into words/tokens based on non-alphanumeric characters and remove files with fewer than 10 tokens. Next, we compute the MinHash with 256 permutations of all documents, and use Locality Sensitive Hashing to find clusters of duplicates. We further reduce these clusters by ensuring that each file in the original cluster is similar to at least one other file in the reduced cluster. We consider two files similar when their Jaccard similarity exceeds 0.85.

Near-duplicates are still difficult to measure. So we should expect duplication, and it should be proportional to the number of samples we have (even if the same variance, but I'd wager higher variance with larger duplications).

[0] https://github.com/openai/code-align-evals-data/tree/97446d9...

[1] https://arxiv.org/abs/2211.15533