Hacker News new | ask | show | jobs
by internetter 839 days ago
Dorking is the technique of using public search engine indexes to uncover information that is presumed to be private. It has been used to uncover webcams, credit card numbers, confidential documents, and even spies.

The problem is the website administers who are encoding authentication tokens into URL state, not the naive crawlers that find them.

4 comments

That isn't an inherent problem with having a secret in the url. The problem is the url was leaked somewhere where it could get indexed.

And sometimes it isn't practical to require a POST request or a cookie.

And the risk of a url leaking can be greatly mitigated if the url is only valid for a short period of time.

> That isn't an inherent problem with having a secret in the url. The problem is the url was leaked somewhere where it could get indexed.

Technically you're right -- after all sending an authentication as a separate header doesn't make any difference.

    GET /endpoint/?Auth=token
or

    GET /endpoint
    Auth: token
Sends the same data over the wire.

However software treats URLs differently to headers. They sit in browser histories, server logs, get parsed by MITM firewalls, mined by browser extensions, etc

using https://user:pass@site.com/endpoint or https://auth:token@site.com/endpoint

Would be better than

https://site.com/endpoint/user/pass or https://site.com/endpoint/?auth=token

As the former is less likely to be stored, either on the client or on the server. I don't do front end (or backend authentication -- I just rely on x509 client certs or oidc and the web server passes the validated username)

For better or worse, basic auth in the URL isn't really an option any more, (e.g. see https://stackoverflow.com/a/57193064). I think the issue was that it reveals the secret to anyone who can see the URL bar, but the alternative we got still has that problem and also has the problem that the secret is no longer separable from the resource identifier.
The browser could hide the secret after it is entered.
Yeah, and it would still be useful for queries that never appear in the URL bar (like EventSource and WebSocket, where setting an Authorization header is not something that’s exposed by the browser)
It can be OK to put authentication tokens in urls, but those tokens need to (at a bare minimum) have short expirations.
>It can be OK to put authentication tokens in urls

When would this ever be necessary? URL session tokens have been a bad idea ever since they first appeared.

The only things even near to auth tokens I can reasonably see stuffed into a URL are password reset and email confirmation tokens sent to email for one time short expiration use.

Outside of that, I don't see any reason for it.

"presigned" URLs[1] are a pretty standard and recommended way of providing users access to upload/download content to Amazon S3 buckets without needing other forms of authentication like IAM credential pair, or STS token, etc

Web Applications do utilize this pattern very frequently

But as noted i previous comment these do have short expiry times (configurable) so that there is no permanent or long-term risk on the lines of the OP article

[1]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-...

You are right about short expiry times but another catch here is that if pre-signed URLs are being leaked in an automated fashion, these services also keep the downloaded content from these URLs around. I found various such examples where links no longer work, but PDFs downloaded from pre-signed URLs were still stored by scanning services.

From https://urlscan.io/blog/2022/07/11/urlscan-pro-product-updat...

> In the process of scanning websites, urlscan.io will sometimes encounter file downloads triggered by the website. If we are able to successfully download the file, we will store it, hash it and make it available for downloading by our customers.

Indeed, the only valid operation with the magic URL is exchanging the URL-based token with something else (your PDF, a session token, etc.) and then expiring the URL, so by the time the scanner gets around to it the original URL is invalid.
That seems ripe for race condition class problems.
Aha. That is an interesting issue indeed.
Interesting. I haven't built on s3, and if I did my first instinct would probably have been to gate things through a website.

Thanks for sharing your knowledge in that area.

They're useful for images when you can't use cookies and want the client to easily be able to embed them.
I wonder if there would be a way to tag such URLs in a machine-recognizable, but not text-searchable way. (E.g. take every fifth byte in the URL from after the authority part, and have those bytes be a particular form of hash of the remaining bytes.) Meaning that crawlers and tools in TFA would have a standardized way to recognize when a URL is meant to be private, and thus could filter them out from public searches. Of course, being recognizable in that way may add new risks.
We already have a solution to this. It’s called not including authentication information within URLs

Even if search engines knew to include it, would every insecure place a user put a link know it? Bad actors with their own indexes certainly wouldn’t care

How do you implement password-reset links otherwise? I mean, those should be short-lived, but still.
You could send the user a code that they must copy paste onto the page rather than sending them a link.
Hopefully using POST not GET. The GET links get logged in the HTTP server most of time. Just another great way to store your 'security credential' in plain text. Logs gets zipped and archive. Good luck with any security measure.
I mean of course the idea was to put it in a form that is sent using POST, but even then, it's a single-use reset code so once it shows in the log it's worthless.
As you said, short lived codes. And the codes don’t contain any PII. So even if the link does get indexed, it’s meaningless and useless.
A short-lived link that's locked down to their user agent/IP would work as well.
Actually, there are cases where this is more or less unavoidable.

For example, if you want a web socket server that is accessible from a browser, you need authentication, and can't rely on cookies, the only option is to encode the Auth information in the URL (since browsers don't allow custom headers in the initial HTTP request for negotiating a web socket).

Authentication: Identify yourself

Authorization: Can you use this service.

Access Control/Tokenization: How long can this service be used for.

I swipe my badge on the card reader. The lock unlocks.

Should we leave a handy door stopper or 2x4 there, so you can just leave it propped open? Or should we have tokens that expire in a reasonable time frame.. say a block of ice (in our door metaphor) so it disappears at some point in future? Nonce tokens have been a well understood pattern for a long time...

Its not that these things are unavoidable its that security isnt first principal, or easy to embed due to issues of design.

> Or should we have tokens that expire in a reasonable time frame.

And that are single-use.

(Your password reset "magic link" should expire quickly, but needs a long enough window to allow for slow mail transport. But once it's used the first time, it should be revoked so it cannot be used again even inside that timeout window.)

> the only option is to encode the Auth information in the URL (since browsers don't allow custom headers in the initial HTTP request for negotiating a web socket).

Put a timestamp in the token and sign it with a private key, so that the token expires after a defined time period.

If the URL is only valid for the next five minutes, the odds that the URL will leak and be exploited in that five minute window is very low

Also, it would allow bad actors to just opt out of malware scans - the main vector whereby these insecure URLs were leaked.
So there was an interesting vector a while back where some email firewalls would reliably click on any link sent to them that was abused by spammers.

Spammers would sign up for services that required a click on a link using blabla@domainusingsuchservice

The services bots to check phishing would reliably click on the link, rendering the account creation valid.

One particularly exploitable vendor for getting such links clicked was one that shares the name with a predatory fish that also has a song about it :)

SharkGate?

Why coy about naming them?

Barracuda. And for plausible deniability so they don’t have as much of a chance of catching a libel suit. Not sure how necessary or effective that is, but I do understand the motivation.
Yeah - that's just red-flagging "interesting" urls to people running greyhat and blackhat crawlers.
We already have robots.txt in theory.
I didn’t think robots.txt would be applicable to URLs being copied around, but actually it might be, good point. Though again, collecting that robots.txt information could make it easier to search for such URLs.
"public search engine indexes"

Then it should be the search engine at fault.

If you leave your house unlocked is one thing.

If there is a company trying everyone's doors, then posting a sign in the yard "this house is unlocked", has to account for something.

A plain URL is an open door not a closed one. Most websites are public and expected to be public.
Isn't that the point of the post?

There are URL's that are out there 'as-if' public, but really should be private.

And some people argue they should be treated as private, even if it is just a plain URL and public.

You can't blame the search engine for indexing plain URLs. Listing a closed-but-unlocked door is a bad analogy.
Well. You also can't charge joe blow with a crime for browsing URL's, that happen to be private but accidentally made public.

Just by looking, you are guilty. That is wrong.

You've been appropriately downvoted for a terrible take.

Imagine if you left your house unlocked it would be broken into seconds later. Even worse, the people that broke into it live in a different country with no extradition law and you'd never figure out who they are anyway.

In this case your insurance company would tell you lock your damned doors and the police may even charge you under public nuisance laws.

Yeah, it is a terrible take. It's a bad situation.

Just like charging people for a crime for accessing private material, simply by browsing a public URL.

Maybe Better take:

It is like someone being charged for breaking and entering, simply by looking at a house from the street, when the door was left open. Your guilty by simply looking, and seeing inside. But you were just walking by, you saw inside before realizing it was a crime, now your guilty.

If you are going to charge people for accessing private sites, potentially by accident, by simply being provided a public URL from a search engine. Then shouldn't the search engine have some culpability?

Or. Better. Change the law so the onus is on the site to protect itself.

"" Imagine a journalist finds a folder on a park bench, opens it, and sees a telephone number inside. She dials the number. A famous rapper answers and spews a racist rant. If no one gave her permission to open the folder and the rapper’s telephone number was unlisted, should the reporter go to jail for publishing what she heard?

If that sounds ridiculous, it’s because it is. And yet, add in a computer and the Internet, and that’s basically what a newly unsealed federal indictment accuses Florida journalist Tim Burke of doing when he found and disseminated outtakes of Tucker Carlson’s Fox News interview with Ye, the artist formerly known as Kanye West, going on the first of many antisemitic diatribes.""

https://arstechnica.com/tech-policy/2024/03/charges-against-...

"According to Burke, the video of Carlson’s interview with Ye was streamed via a publicly available, unencrypted URL that anyone could access by typing the address into your browser."