Hacker News new | ask | show | jobs
by dna_polymerase 3057 days ago
It is the fucking internet, if you put something on there you should expect someone to find it, be it a crawler or an attacker.

> 1. they are crawling potentially sensitive information granted by links with tokens

If tokens in GET params are your security concept: please leave the entire field.

2. they are triggering potentially harmful and/or confusing actions in your website by repeating links

So you built something that can be triggered by a simple HTTP request and may have a harmful potential? Wow.

3. they are repeating requests in a broken way by not encoding url-parameters correctly

You are kidding right? That's a problem to you? Either your Webserver drops these or your routes don't match, end of story.

4. I could not find a warning or note on their tracking-pixel documentation that pages tracked would be crawled later

Not a problem, you put it on the web and it will be crawled. Did you ever use Chrome? They report every URL you type to the Google Crawler. Read that anywhere lately?

6 comments

Everything you said was technically correct yet the message will probably be lost due to the manner in which you decided to delivered it.
You're correct, but there's no need to be a dick about it
Can we make a minor exception for this case? Please? Let's trust the OP has a good sense of humor and can interpret critique apersonally.
While I certainly don't disagree with what you said. I think you need to look at his arguments as a way to protect user data. Not all users that use your "mediocre" technical solution are aware of how "mediocre" it is. And if tokens are sent with GET requests or whatever stupid thing.
> Not a problem, you put it on the web and it will be crawled. Did you ever use Chrome? They report every URL you type to the Google Crawler. Read that anywhere lately?

Do you have a source for this? I Googled (!) and found this: https://www.stonetemple.com/google-chrome-discover-pages , which implies the opposite.

I don't use Chrome personally, but I do occasionally dump [none-too critical] preview files on open but otherwise 'hidden' urls on a domain for clients to view. I just find it easier for clients to deal with than inevitably lost passwords, etc, and tend to ask them to let me know when they're done so I can delete the folder.

I'd be interested to know whether their likely use of Chrome means that Google has a pattern of understanding of my domain space!

to clarify:

- marketing wants some tracking, some developers adds it

- ecommerce websites in the real world tend to "need" these tracking/conversion codes

- you do have legitimate get-requests like password-reset links with tokens, also we do use payment providers who send the customers back to us with get links which include payment tokens, newsletter-unsubscribe links are also often simple token links

- and yes normally a get-request should not change anything (at least not when its just repeated) but the sheer fact that they have access to it _and_ are crawling it is bad

my point being that I find it that they would just crawl everything they recorded instead of just crawling pages which are linked publicly or which are targeted in ad-campaigns combined with the fact that they don't warn you about it

> my point being that I find it that they would just crawl everything they recorded instead of just crawling pages which are linked publicly or which are targeted in ad-campaigns

There's no way to know which pages are linked publicly without crawling every page for links. So you're right back at square one.

Ultimately if it's on a Internet-facing web server and not hidden behind an IP whitelist or secure login function then you have to assume it is public. All you are arguing is about different degrees of "public" which somewhat misses the real issue of website security.

Some crawlers do deliberately hit random URLs to check how you're handling 404s. Over crawlers are entirely dishonest and will try to find content that wasn't intended to be made public. How are you going to handle them if you're stumped with the Facebook crawlers that you invited onto your site?

> ...combined with the fact that they don't warn you about it

It's pretty obvious behavior in my opinion but maybe they could have been more explicit. However going back to my previous point, no other crawler advertises what it's going to crawl beforehand. So where do you draw the line? Ranting that Google indexed your site? What about visitors buying stuff on your ecommerce package without prior communication requesting access to the site?

You wouldn't ask customers in a bricks-and-mortar store to state their intentions the moment they walked through the shop door so why should every HTTP user agent have to do the same? While web security can be both complex and maddening, responsibility of hardening the site is still yours; not Facebook's.

Stuff like this currently exists in the real world. Therefore, I can understand the complains of the OP.