Hacker News new | ask | show | jobs
by div72 409 days ago
> The largest portion of all languages in Common Crawl

https://commoncrawl.github.io/cc-crawl-statistics/plots/lang...

1 comments

Thanks!

I wonder where this discrepancy comes from

probably under-indexing of non-english sources by these crawlers.

would be interesting if yandex opened some data sets!

And lots of people write on the web using English as a second language, which both reduces the presence of their native language and increases the presence of English.
yep not a native english speaker here and yet my online footprint is mostly english due to software pushing me to learn it
My guess is that reference counting at depth=1 only capture non-$LANG content which text parts don't matter a lot, e.g. photo galleries.