|
|
|
|
|
by godelski
870 days ago
|
|
Does anyone know how much spoilage are in these datasets? Common crawl has a lot of websites in it, including Reddit and Stack*. I'm certain there are lots of questions in those datasets and we want to differentiate recall from problem solving (often confused). I have a deep distrust when using large datasets like this given a common one with 60 authors assumed writing leet code style programs by hand would mean they wouldn't appear in the training data (github) and didn't even bother to check. It's really hard to sanitize datasets of this size and deduplication is a much harder task than many realize. https://arxiv.org/abs/2107.03374 https://arxiv.org/abs/2303.09540 |
|
>To avoid benchmark contamination, we follow Guo et al. (2024) to filter out web pages containing questions or answers from English mathematical benchmarks such as GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) and Chinese benchmarks such as CMATH (Wei et al., 2023) and AGIEval (Zhong et al., 2023). The filtering criteria are as follows: any text segment containing a 10-gram string that matches exactly with any sub-string from the evaluation benchmarks is removed from our math training corpus. For benchmark texts that are shorter than 10 grams but have at least 3 grams, we employ exact matching to filter out contaminated web pages.
However, benchmark contamination is difficult, and ngram matching is often insufficient. See https://arxiv.org/pdf/2311.04850.pdf for some examples of how this approach can fail.
In general, if a benchmark is available online before a model's dataset is collected, I put very little stock into that model's performance on that benchmark. It's just too hard to know what's a true improvement and what's contamination. It's especially true for a paper like this that specifically hunts down MATH-like data.