if you're using schema.org there's a nice filtered data dump from the common crawl available here:
http://webdatacommons.org/structureddata/2018-12/stats/schem...