| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mysteryleo 5184 days ago
	I wish google would let me crawl blogger pages for my index. I don't think that is fair that google can index the web, but has created a walled garden for the content they host. Basically, if you start indexing the web, and some sites are hosted in blogger, google will block your crawler if I remember correctly

2 comments

abraham 5184 days ago

What are you talking about? The only aspect of Blogger that is robot restricted are the search pages (as they should be).

http://googleblog.blogspot.com/robots.txt

link

mysteryleo 5184 days ago

Talking about this

http://support.google.com/websearch/bin/answer.py?hl=en&...

There other ways of blocking crawlers other than robots.txt

link

forgotusername 5184 days ago

    User-agent: Mediapartners-Google
    Disallow: 

    User-agent: *
    Disallow: /search
    Disallow: /

    User-Agent: googlebot
    Disallow: /search
    Allow: /

Woah, that is surprising. I note Bing has blogspot in its index anyway. Perhaps they use the ATOM API when they see a Blogspot URL? (technically not 'crawling')

link

abraham 5184 days ago

Where are you getting that? It doesn't match what I'm seeing. http://googleblog.blogspot.com/robots.txt

link

forgotusername 5184 days ago

Googled for "blogspot", picked first random domain I saw, "weliveyoung.blogspot.com", fetched "weliveyoung.blogspot.com/robots.txt" with curl, got a redirect to "weliveyoung.blogspot.co.uk/robots.txt", fetched that, voila.

Perhaps there is a user setting that controls it.

link

abraham 5184 days ago

Looks like you can use whatever you want. The one I linked to is the default.

https://support.google.com/blogger/bin/answer.py?hl=en&a...

link