| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by budgi3 5453 days ago

so what should his robots.txt look like? at the moment it is:

User-Agent: *

Disallow: /21000/

2 comments

saalweachter 5453 days ago

It's mostly sufficient. /21000/ will not match "http://picolisp.com/21000, which is the first URL in the sequence, but the remaining URLs look like "http://picolisp.com/21000/!start?*Page=+2, so Googlebot will likely only continue to download a single page once it has re-read the robots.txt.

Which is what you deserve for using non-standard URL formats.

link

Florin_Andrei 5452 days ago

Hold on, slash at the end is not standard?

link

saalweachter 5452 days ago

No, I'm saying /21000/ will match a path with a directory named /21000 but not a file named /21000.

When I say "non-standard", I am saying am saying that if the website's URLs looked like "/21000/foo" and "/21000/foo?page=2", it would have been easier to craft a "Disallow" rule that would have successfully blocked all of the desired pages.

link

bauchidgw 5453 days ago

   User-Agent: *
   Disallow: /21000

   User-Agent: *
   Disallow: /

link