More important for me is how you identify news sites, let alone 200k of them. Is there any online source that lists them? Or do you cherry pick them one by one?
It's a whole thing... I run a project called websitelaunches, so I have index of basically the whole internet (500M+) sites. I took the top ~200k news related sites from there that had rss feed.
And to add to the above, is there a list of the websites you use and any information on sampling methodology? Is it perfectly random or weighted? Do you trust the timestamp from an RSS feed?