Hacker News new | ask | show | jobs
by wheels 5325 days ago
Have you considered putting a caching reverse proxy in front of the arc app to keep the backend from having to render all of the old pages?

It seems like the only dynamic element of old articles is the "$x days ago" bit and that'd be pretty easy to turn into something static by instead just putting in timestamps in the actual HTML and using Javascript to transform them into how many hours / days ago they were. Then the crawlers would just be pulling out cached, pre-rendered HTML.

There's an example of doing such with nginx here:

http://serverfault.com/questions/30705/how-to-set-up-nginx-a...

With that you'd just have to send out the HTTP header from the arc app saying that current articles expire immediately, and old ones don't.

1 comments

I believe Rtm has already set one up.
The conspicuous lack of a "Server:" header inclines me to believe that that's probably not the case (most web servers set one indicating the server software and version). Here are the headers that HN sends out from an old post (20 days ago):

  HTTP/1.1 200 OK
  Content-Type: text/html; charset=utf-8
  Cache-Control: private
  Connection: close
  Cache-Control: max-age=0
My favorite part of HN's headers: the lines are separated by naked LFs instead of CRLF, in violation of the HTTP spec
This is common violation that everyone accepts. It's definitely done by 'bad' clients - not sure how often servers send bare LF.

(I used to telnet to port 80 for testing, and type GET / HTTP/1.0 <enter> <enter>, and that should be LF on Linux & Mac)

You don't have a problem with one of the most trafficked sites for programming/web startup-related news implementing HTTP incorrectly?

Do you ignore whether your HTML is valid just because the browser rendered it correctly?

Yup.

I've got real work to do. Making a validator happy is fake work.

You don't know that everyone accepts it. Even if they did, it doesn't make it right.
I fixed submitted a patch for this in the pecl_http PHP library:

https://bugs.php.net/bug.php?id=58442

We use varnish for caching and check the useragent for requests.

If the cache has a copy of an article that is a few hours old it will just give that version to Googlebot while if it thinks a human is requesting the page then it will go to the backend and fetch the latest version.

https://www.varnish-cache.org/lists/pipermail/varnish-misc/2...

+1 for varnish. It's stupidly[1] fast and there shouldn't be much trickery required to deflect most of HN's traffic (e.g. ~10 sec expiry for "live" pages, infinite expiry for archived pages).

[1] 15k reqs/sec on a moderate box