Hacker News new | ask | show | jobs
by city41 881 days ago
One of my websites has links to nytimes.com. They work fine if clicked on manually. Bernard reports them as a 403. I wonder if NYT is classifying Bernard as a scraper?
1 comments

I'm also seeing valid links to reuters.com coming up as 401 unauthorized
The Internet is a wild place, and I reckon 90% of the complexity of a crawler is dealing with workarounds and non-compliant servers (cough www.apple.com cough).

I'll have a look, thanks for the heads up.

are you setting the headers to make the sites think it's a browser??

edit: User-agent: bernard/1.0"

I bet thats going to cause issuses.

Id fake a browser user agent for off domain sites.