Hacker News new | ask | show | jobs
by nothis 2114 days ago
>Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.

The scale of web stuff sometimes surprises me. 1B web pages sounds like just about the daily web output of humanity? How can you handle this with 4 (fast) computers?

2 comments

Computers are very fast. We just tend to not notice because today's software is obese.
Yes, let's all run separate web browsers as the application and run our own JavaScript inside our browser. Who cares if there's 5 other "apps" doing exactly the same!

Insanity.

Multiple tabs/browser windows is similar and generally not an issue.
I think they were referring more to apps like Slack and other similar JS browser/ JS based apps which run separate from the browser. Maybe I'm being generous? Slack is certainly itself a beast.
Yes this is precisely what I meant. eg. Electron apps, Slack, VS Code, Skype etc. etc. ad nauseum
Doesn't it depend on a lot of things? For example you can only do head requests to see if a page changed since a given timestamp. If not then there is no need to process it.
For anybody wondering how:

The HTTP HEAD method requests the headers that would be returned if the HEAD request's URL was instead requested with the HTTP GET method. For example, if a URL might produce a large download, a HEAD request could read its Content-Length header to check the filesize without actually downloading the file.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/HE...

Typically you wouldn’t bother with a HEAD request, you’d do a conditional GET request.

When you request a page, the response includes metadata, usually including a Last-Modified timestamp and often including an ETag (entity tag). Then when you make subsequent requests, you can include these in If-Modified-Since and If-None-Match request headers.

If the resource hasn’t changed, then the server responds with 304 Not Modified instead of sending the resource all over again. If the resource has changed, then the server sends it straight away.

Doing it this way means that in the case where the resource has changed, you make one request instead of two, and it also avoids a race condition where the resource changes between the HEAD and the GET requests.

Do a lot of random pages return etags? I've only ever seen them in the AWS docs for boto3
nginx sends it by default for static files (example: hacker news [0]), I assume other web servers do too.

[0] https://news.ycombinator.com/y18.gif