Hacker News new | ask | show | jobs
by budgi3 5453 days ago
so what should his robots.txt look like? at the moment it is:

User-Agent: *

Disallow: /21000/

2 comments

It's mostly sufficient. /21000/ will not match "http://picolisp.com/21000, which is the first URL in the sequence, but the remaining URLs look like "http://picolisp.com/21000/!start?*Page=+2, so Googlebot will likely only continue to download a single page once it has re-read the robots.txt.

Which is what you deserve for using non-standard URL formats.

Hold on, slash at the end is not standard?
No, I'm saying /21000/ will match a path with a directory named /21000 but not a file named /21000.

When I say "non-standard", I am saying am saying that if the website's URLs looked like "/21000/foo" and "/21000/foo?page=2", it would have been easier to craft a "Disallow" rule that would have successfully blocked all of the desired pages.

   User-Agent: *
   Disallow: /21000
or

   User-Agent: *
   Disallow: /