I doubt such thing, considering especially some services may require an account in order to read messages, and some are unknown (for various reasons). Also, not all of the communications is done by web some are done using IRC, NNTP, etc.
I think the "web" here is used to refer to the World Wide Web, which I believe is limited to `text/html` [1] — i.e. excludes IRC, NNTP, etc., which would probably just be classified as "internet communication" (as you mention).
It would probably be better if the service were more explicit about what it scanned — after all, 'the web' means very different things to different people — but I think it's safe to say that "scrapable" html served over http(s) is the indended meaning.
Yes, I did expect they meant HTML served over HTTP(S), although not all documents are HTML and not all internet communications are "the web", and even among those that are HTML served over HTTP(S) and available on the internet, many might not be found so easily. It does need to be mentioned better, because "the entire web" still seems not specific enough; even if it is only HTML over HTTP(S), exactly which documents are found? (I also think it strange they say they monitor "the entire web plus social networks". Most (maybe all?) social networks are web based, and even if they don't, they don't specifically say which ones.) (Still, also note that IRC logs are sometimes available over HTTP (in which case presumably they are still going to be found), but sometimes they are plain text and not HTML. Plain text documents are used for other reasons too.)
Data ingress is usually free, which really cuts down on costs when scraping. If you can do everything in-memory, it's surprisingly cheap. The important bit is being respectful of robots.txt files and not overloading small sites with too many requests.