Hacker News new | ask | show | jobs
by zhangce 969 days ago
There are actually a few ways to do this; and we have four:

- `rps_doc_ml_wikiref_score`: a classifier that classifiers random webpage with Wiki references (used in Llama-1)

- `ccnet_perplexity`: perplexity of an LM trained on Wikipedia (used in CCNet)

- `rps_doc_ml_wikipedia_score`: classifier prediction for the document being a Wikipedia article

- `rps_doc_wikipedia_importance`: Used in https://arxiv.org/abs/2302.03169

You can see the full table here: https://together.ai/blog/redpajama-data-v2