Hacker News new | ask | show | jobs
by nwh 4490 days ago
Article without the paywall — http://archive.is/JWikk
1 comments

It's pretty embarrassing that you can read most of the article with the paywall intact.
That's amazing. If I'd subscribe to see that article, I would be pretty pissed.
I think its more intended for other news outlets.
I've never understood how WSJ operate their paywall. They seem to allow it to be bypassed on any article by simply using Google as a referrer.

For any given paywalled article, if you Google the article headline and click on the WSJ link in the results, the full article is displayed.

If they activated the paywall for Googlebot, they'd get no visitors from Google searches.

If they deactivated the paywall for Googlebot, but activated it for visitors, they'd get banned from Google for expertsexchange behavior.

They must figure that random Google visitors are worth more than the people who are clever enough to Google the article titles they want to read to work past the paywall.

I was wondering if they were deliberately allowing archive sites past the paywall.
More likely archive.is is pretending to be Googlebot. The author has a lot of server side tricks, like fake Facebook accounts so that it can archive Facebook pages properly.
I'm attempting to replicate this. Searching the last sentence (which is behind the paywall) brings it up in google right away, so I think you're right. However, using Googlebot's user agent doesn't work, so it must be slightly more sophisticated. The result in Google is also not-paywalled, though going directly to the link is. So maybe they use a simpler strategy, and just mess with the parameters. This is the result from google: http://online.wsj.com/news/articles/SB1000142405270230388060...
Searching at Google for

    "cache:http://online.wsj.com/news/articles/SB10001424052702303880604579405852448992982?"
gets me the full text article, they could just be stripping the header from the page and displaying that? I know it does detection of cached Google pages in some circumstances.