Hacker News new | ask | show | jobs
by tptacek 2536 days ago
There's a running joke among web pentesters about robots.txt being the first place you look when hitting a new site.
8 comments

Meanwhile over in .gov I’ve had to explain to a pentester that it wasn’t a security problem that robots.txt was accessible without authentication, based on a very big vendor’s scanner having badly regurgitated the OWASP advice.
The "security" world has an unusually high level of total incompetence. It is scary.
This is common any time there’s so much demand: in the late 90s it was not uncommon to be in a room full of people who were ostensibly web developers and didn’t understand how the web or their backend servers worked but were certain they were about to become rich.

Security is especially bad because so many large organizations are under pressure to improve but the market is tight and the pool of experts is limited. Also, many places have outsourced to large contracting companies who don’t want to admit they don’t have enough qualified staff and will hope that you’ll be satisfied with whoever they deliver.

Yeah no doubt it is a phase.

It's just a really nasty phrase right now.

I always think of this:

https://medium.com/@djhoulihan/no-panera-bread-doesnt-take-s...

A few years ago I purposefully put a couple of "interesting" paths in the robots.txt as a honeypot to test/capture bot conformance and malicious actors. Not one hit ever.
They just found a path further up and compromised you via that instead of bothering with the rest of the robots.txt :D
A while back I wrote a Python script to watch for links posted on Twitter and then scrape their /robots.txt file [1]. The requests are routed through Tor for privacy purposes.

It's been incredibly enlightening. One thing that sticks out immediately is that you can identify the underlying HTTP framework in many cases due to the defaults. Sometimes even the exact version.

And, yes, people do use the robots file to "protect" or "hide" endpoints and they can effectively be used to enumerate potential endpoints worth investigating further (from a pentesting perspective).

[1] https://gist.github.com/wybiral/20c20ccf00b6c93506b8acdc6ccb...

Silly old me always starts with / in a browser. Then I click on links. Not all sites leak information like a sieve with the wire bit removed but many do. There is sometimes no need to do anything clever like look for robots.txt.
It’s like walking through an office and seeing an unlocked door with a “Do not enter” sign.
In addition to the obvious that is literally a list of places where admins don't want to look, it is also often useful in backend technology enumeration.
It’s very literally the second bullet point on my enumeration list for web apps, right behind looking at the DNS records for the domain.
It's far from a joke.