| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by john111 2386 days ago
	You say you monitor the whole web. How is that possible?

1 comments

zzo38computer 2386 days ago

I doubt such thing, considering especially some services may require an account in order to read messages, and some are unknown (for various reasons). Also, not all of the communications is done by web some are done using IRC, NNTP, etc.

link

epoch_100 2386 days ago

I think the "web" here is used to refer to the World Wide Web, which I believe is limited to `text/html` [1] — i.e. excludes IRC, NNTP, etc., which would probably just be classified as "internet communication" (as you mention).

It would probably be better if the service were more explicit about what it scanned — after all, 'the web' means very different things to different people — but I think it's safe to say that "scrapable" html served over http(s) is the indended meaning.

[1] https://en.wikipedia.org/wiki/World_Wide_Web

link

zzo38computer 2386 days ago

Yes, I did expect they meant HTML served over HTTP(S), although not all documents are HTML and not all internet communications are "the web", and even among those that are HTML served over HTTP(S) and available on the internet, many might not be found so easily. It does need to be mentioned better, because "the entire web" still seems not specific enough; even if it is only HTML over HTTP(S), exactly which documents are found? (I also think it strange they say they monitor "the entire web plus social networks". Most (maybe all?) social networks are web based, and even if they don't, they don't specifically say which ones.) (Still, also note that IRC logs are sometimes available over HTTP (in which case presumably they are still going to be found), but sometimes they are plain text and not HTML. Plain text documents are used for other reasons too.)

link

corentin88 2386 days ago

Even if it’s just http(s) requests that’s a lot of data to find & crawl. The bandwidth costs are probably insane.

link

JamesGreene 2385 days ago

I have a background in scraping from prior projects over the last decade.

Bandwidth is not a concern for projects like this at a lot of hosting/VPS providers.

link

epoch_100 2385 days ago

Data ingress is usually free, which really cuts down on costs when scraping. If you can do everything in-memory, it's surprisingly cheap. The important bit is being respectful of robots.txt files and not overloading small sites with too many requests.

link