Hacker News new | ask | show | jobs
by aaarrm 24 days ago
Is it possible able to host your website in a way so that it couldn't be found via search engines (and thus wouldn't be crawlable I hope)?

I know this has repercussions on findability, but if that wasn't a concern, I'm curious how one might circumvent getting crawled.

5 comments

If you really wanted and are interested in doing so and perhaps are even happy with just text and normal styling limitations, I recommend you to test out other protocols like creating a gemini website or gopher website. I don't think that scraping happens on even remotely the same scale there as compared to conventional websites

That being said you would require your user to download a compatible browser for gemini/gopher.

Sure, depends on how accessibly to people you want it to be.

Most legit search engines are going to honor robots.txt and you can disallow access.

Next level would be using something like rate limiting controls and/or Cloudflare's bot fight mode to start blocking the bad bots. You start to annoy some people here.

Next would be putting the content behind some form of auth.

I don't know why we are trusting cloudflare when they are the one creating crawlers.

https://developers.cloudflare.com/browser-run/quick-actions/...

Possible yes, probable not likely. The moment you're issued a certificate your domain will be shown in the Certificate Transparency logs which are constantly monitored from anyone who wants to find new sites.
....Yet another vector through which "security experts" has caused a waterbed problem. Let's secure the Internet, oh no! We made a centralized list of operating domains for hostile actors to guide attacks with!
Sure, let's hide everything behind obscure schemes which will definitely serve the spirit of openness of the web.
The point is that you can't escape side-channel applications of security metadata being weaponized the more you try to force ubiquity of "security" everywhere. As long as there are motivated, profit seeking attackers, you have to take into account the toxic nature of metadata. This is another example of "A System Is What It Does" proving the pointlessness of "POSIWID". Intent doesn't matter. Certificate transparency was intended to clue us into bad cert issuing, but it is also a list of potential targets where AI crawlers can be directed to scrape new data. Intent doesn't change what it is. Cert transparency is certainly transparency + a "training data might end end up here" list.
robots.txt is a way of leaving the door unlocked but kindly asking bots to stay outside.
Which in a law-abiding society should be enough. It's also how we do things in the real world in many cases - i.e. here you can just write on your mailbox "no ads" and companies have to respect that.

Even when we do actually put physical locks on things they are mostly there to show that someone breaking in did so intentionally and not at all designed to prevent motivated attackers.

> here you can just write on your mailbox "no ads" and companies have to respect that

Where do you live? In the US it’s actually illegal for anyone except the USPS to deliver to a mailbox.

You might be interested to know that entering an unlocked door into a space you do not have permission to be in is still illegal.
You might be interested to know that the “illegality” depends on the intent. If I rest on your unlocked door handle, it opens, I enter, it’s an accident.
Sorry, what? In this scenario are you claiming that you accidentally fell inside the restricted area because you were leaning on the door? Or are you claiming that you accidentally opened the door and then walked through intentionally? In the former case, you are guilty of breaking and entering in most US jurisdictions if you don’t promptly get out. Any sane court would likely agree an accidental trespass is probably not a criminal act, but it’s not an accident if you stay. In the latter case, you’re clearly trespassing illegally.

Also this has gotten pretty far away from the web scraping scenario. There’s no door accidentally opening here.

Oops, I just accidentally fell into every website. Don't know how that happened ...
Which works when you live in normal civil times, when you live in jungle times people and robots will do whatever they want and the most powerful will get their way.
You could just put your website content behind its own chat interface. The crawler would just see a form input for a prompt.