Hacker News new | ask | show | jobs
by natch 970 days ago
Thanks.

> how similar this document is to Wikipedia

So that’s a measure of how similar it is to the background vector of all (language in focus) Wikipedia data?

1 comments

There are actually a few ways to do this; and we have four:

- `rps_doc_ml_wikiref_score`: a classifier that classifiers random webpage with Wiki references (used in Llama-1)

- `ccnet_perplexity`: perplexity of an LM trained on Wikipedia (used in CCNet)

- `rps_doc_ml_wikipedia_score`: classifier prediction for the document being a Wikipedia article

- `rps_doc_wikipedia_importance`: Used in https://arxiv.org/abs/2302.03169

You can see the full table here: https://together.ai/blog/redpajama-data-v2