I myself wrote a webserver, albeit a specialised one and for curiosity, I also created a few pages which were in no way accessible unless you knew its web address, there were no links to these pages from the home page or anything, I didn't even tell anyone about these webpages and yet in my logs, I could see those webpages were being spidered!
My robots.txt was setup as an instruction to proceed no further, so I think there is other feedback mechanisms guiding the spiders but I havent worked out if its from the web browser, or actual infrastructure like switches or routers.
On an eCommerce site I'm responsible for I changed some links from a GET to a POST. "BingPreview" continued hitting those links with GET requests, polluting my logs with 100s of "method not allowed" entries. So I blocked that UA from those links, nothing changed. Banned the bot all together, still hitting my site. This went on for well over a year.
What does that mean exactly? An actual user can't be involved because the links that trigger a GET simply aren't there anymore. Therefore I assume it's a bot hitting faulty links it finds in its cache.
I myself wrote a webserver, albeit a specialised one and for curiosity, I also created a few pages which were in no way accessible unless you knew its web address, there were no links to these pages from the home page or anything, I didn't even tell anyone about these webpages and yet in my logs, I could see those webpages were being spidered!
My robots.txt was setup as an instruction to proceed no further, so I think there is other feedback mechanisms guiding the spiders but I havent worked out if its from the web browser, or actual infrastructure like switches or routers.
Admittedly this was before HTTPS became common.