Hacker News new | ask | show | jobs
by jdorfman 1745 days ago
Saw this on Twitter yesterday, and it looks interesting. With that said, the one concern my team has is around privacy. The blog post says:

"All without ever having access to personally identifiable information or invading the privacy of your users."

Can you elaborate on how you go about that?

1 comments

In short, using Scarf does not provide personally identifiable information about who is downloading your artifacts because we don't have that data ourselves.

The main way this is achieved is by purging any personally identifiable information from our system, mainly the IP address of a download request. Scarf uses the IP to look up metadata like company affiliation, cloud provider, course grained location, etc, to surface that to you. Once that metadata is looked up, the original IP address is discarded. All information stored long term is fully anonymized.

This is impressive, but seems like a dark pattern to me a la tracking pixels in emails. An annoying use case I could see this used for is targeted spam. Say a company selling a software tool publishes a PDF of industry insights and then reaches out to everyone who's downloaded it. Or they publish an OCI image, and then try to sell everyone who uses it a support package.
Well, Scarf offers free pixel tracking too so you definitely have the correct model for what we do, though sorry to hear you dislike the approach.

Our goal is to help enable OSS developers to financially support their work. Do you think it's still wrong when it's OSS developers trying to sell their services or premium offerings to the companies that already rely on their work? If so - companies are tracking people all the time at a very granular, personally identifiable level. Why should we hold OSS developers to an even higher standard than what we tolerate from large companies?

> Why should we hold OSS developers to an even higher standard than what we tolerate from large companies?

The problem here is that i DON'T tolerate this from large companies either. I find the pixel tracking thing outrageous and disable images by default in my email client to avoid it.

I understand your argument, I just find it personally strongly disagreeable, and I'm willing to bet poster above did as well.

> The problem here is that i DON'T tolerate this from large companies either. I find the pixel tracking thing outrageous and disable images by default in my email client to avoid it.

If you are already using OSS today and grabbing that software over the internet, you are tolerating it even if you claim otherwise. If you pull something down from GitHub, Microsoft has all the data that we're talking about here.

> I understand your argument, I just find it personally strongly disagreeable

Fair! And I understand yours too. I also think your argument is more idealistic than practical for the current state of the ecosystem, especially considering how many parties already have access to this web traffic data. Maintainers having this data too is a very benign additional party to have access to it. Furthermore, it's a concrete way we can all chip in to help OSS maintainers and make their jobs a little bit easier, short of reaching for your credit card (which we should all be doing too).

You're going to be walking a thin (and difficult) line if you're trying to find open source developers also interested in introducing involuntary tracking to their software. Open Source software, since it's inception, has been about creating respectful software for the commons.

> Do you think it's still wrong when it's OSS developers trying to sell their services or premium offerings to the companies that already rely on their work?

No, but I shouldn't have to worry about that as a user. The onus is on corporations to disclose the software that they use in accordance with their respective licenses, the regular user doesn't deserve to suffer for the incompetence of funded organizations.

> Why should we hold OSS developers to an even higher standard than what we tolerate from large companies?

You don't, they do. That's the point of open source licensing in the first place: defining what you're comfortable with other people using your software for. By choosing an Open Source license, you're assuming one of the most difficult and thankless positions in the world of software. That's how it's intended to be though, because that kind of transparency is imperative when we're distributing free software. You wouldn't poison the rations being donated to the homeless, so why are you comfortable poisoning the CDN of my download? This all seems pretty cut and dried to me.

sigh Time to start dropping Scarf URLs in my hosts file...

This argument conflates licensing of a piece software with the the distribution channel that distributes artifacts of that software. The service being discussed here is purely part of the distribution layer and has no footprint on the artifacts themselves. It's merely a passthrough layer sitting in front of the current stack.

If you are using open source today, you're already hitting servers that have access to all of the same information Scarf sees. Visiting a URL is by definition asking a server on the other side to process your request. That data can be very helpful to all of the great open source maintainers out there, but has historically been difficult or impossible to access. The result will be better informed maintainers, and better OSS for everyone.

Take your cut, pick-and-choose your criticism, but remember you're working in a privacy-conscious sector.
That means you can't distinguish between one user downloading many times and many users on one ISP, right?
User agent and other headers can be used to provide more differentiation, but you're correct to point out that limitation (assuming you meant IP not ISP).
Do you mean IP, not ISP?
No, I meant ISP, although IP could work as a special case - if this isn't recording the user's actual IP but just information derived from it (rough location, residential/commercial/datacenter, whatever), I would expect many addresses under that ISP to have the same recorded details. Granted, CGNAT with the same exact public IP would be even more like that, but if you don't record the actual IP then you probably can't deduplicate close "neighbors".

(I should add that your sibling comment says they're using browser headers, which probably reduces this issue a lot)

Information derived from IP can be much more granular than just ISP level