Hacker News new | ask | show | jobs
by mysteryleo 5137 days ago
I wish google would let me crawl blogger pages for my index. I don't think that is fair that google can index the web, but has created a walled garden for the content they host.

Basically, if you start indexing the web, and some sites are hosted in blogger, google will block your crawler if I remember correctly

2 comments

What are you talking about? The only aspect of Blogger that is robot restricted are the search pages (as they should be).

http://googleblog.blogspot.com/robots.txt

Talking about this

http://support.google.com/websearch/bin/answer.py?hl=en&...

There other ways of blocking crawlers other than robots.txt

    User-agent: Mediapartners-Google
    Disallow: 

    User-agent: *
    Disallow: /search
    Disallow: /

    User-Agent: googlebot
    Disallow: /search
    Allow: /
Woah, that is surprising. I note Bing has blogspot in its index anyway. Perhaps they use the ATOM API when they see a Blogspot URL? (technically not 'crawling')
Where are you getting that? It doesn't match what I'm seeing. http://googleblog.blogspot.com/robots.txt
Googled for "blogspot", picked first random domain I saw, "weliveyoung.blogspot.com", fetched "weliveyoung.blogspot.com/robots.txt" with curl, got a redirect to "weliveyoung.blogspot.co.uk/robots.txt", fetched that, voila.

Perhaps there is a user setting that controls it.

Looks like you can use whatever you want. The one I linked to is the default.

https://support.google.com/blogger/bin/answer.py?hl=en&a...