| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by city41 881 days ago
	One of my websites has links to nytimes.com. They work fine if clicked on manually. Bernard reports them as a 403. I wonder if NYT is classifying Bernard as a scraper?

1 comments

floodle 881 days ago

I'm also seeing valid links to reuters.com coming up as 401 unauthorized

link

sph 881 days ago

The Internet is a wild place, and I reckon 90% of the complexity of a crawler is dealing with workarounds and non-compliant servers (cough www.apple.com cough).

I'll have a look, thanks for the heads up.

link

hasty_pudding 881 days ago

are you setting the headers to make the sites think it's a browser??

edit: User-agent: bernard/1.0"

I bet thats going to cause issuses.

Id fake a browser user agent for off domain sites.

link