Hacker News new | ask | show | jobs
by nwh 4490 days ago
More likely archive.is is pretending to be Googlebot. The author has a lot of server side tricks, like fake Facebook accounts so that it can archive Facebook pages properly.
1 comments

I'm attempting to replicate this. Searching the last sentence (which is behind the paywall) brings it up in google right away, so I think you're right. However, using Googlebot's user agent doesn't work, so it must be slightly more sophisticated. The result in Google is also not-paywalled, though going directly to the link is. So maybe they use a simpler strategy, and just mess with the parameters. This is the result from google: http://online.wsj.com/news/articles/SB1000142405270230388060...
Searching at Google for

    "cache:http://online.wsj.com/news/articles/SB10001424052702303880604579405852448992982?"
gets me the full text article, they could just be stripping the header from the page and displaying that? I know it does detection of cached Google pages in some circumstances.