Hacker News new | ask | show | jobs
by mncharity 2274 days ago
Zotero phones home with perhaps more information than you expect.[1]

By default, there's a persistent connection whenever Zotero is running, a request when you visit a site with a translator (eg NYTimes) the first time since a browser start, when you download a PDF, etc.

I enjoyed using it, but their approach to privacy felt creepy. That [1] is somewhat improved... but only somewhat.

[1] https://www.zotero.org/support/privacy#disabling_automatic_r...

3 comments

(Disclosure: Zotero developer)

Criticizing Zotero for privacy, of all things, is a bit bizarre. Zotero is an open-source project from a nonprofit organization with no financial interest in people's research data. It's designed as a local tool specifically to give people complete control over their data, and it's developed in the open. Most similar tools are proprietary programs owned by major publishers or analytics companies with voracious appetites for data.

The page you linked to explains the reasons for every single network connection that Zotero makes and how to disable it. Every one enables a specific Zotero feature — push-based auto-sync, fast translator updates as sites change to minimize save failures, open-access PDF retrieval. When we implemented retraction notifications, we even did it using k-anonymity to avoid sending up library data from people who don't use syncing.

We're always happy to discuss design decisions in our forums, but I'd argue pretty strongly that privacy is one of the main reasons one should use Zotero, not the other way around.

Zotero's privacy features even extend to where data is stored: you can bring your own WebDAV server [1] and have Zotero store data there.

[1] https://www.zotero.org/support/sync#webdav

You do realize that (unless something changed recently) all your metadata is still on their servers? Last time I checked, a Docker image of the sync server was in the works though.
...and you don't automatically upload pirated scihub papers to Elsevier cloud storage, like Mendeley does. That feature alone makes you a winner on the privacy front as far as I'm concerned!
Thanks for your work! It's nice to hear that Zotero seems to have privacy as a feature and not a side thought.
This conversation is now days old, and people have moved on, but here's one quick belated thought. No reply needed, just fyi, fwiw. I was at an embassy party years ago, and apparently a friendly European embassy, in a US city, had a couple of people keeping track of local research, and doing heads-up for their national industry and academia. That's just one embassy, in one city. I suggest Zotero has very different threat profiles across different fields and research topics. And for some, the possibility of state actors should be included. Which suggests a need for users to easy notice and adjust their exposure profiles. And is a reminder that the user data Zotero servers are least likely to compromise, is data that's never seen at all. Fwiw. Thanks again for your work.
Love Zotero. I'd love to install it on my own server instance, although, I think by renting the Zotero storage space now, I'm helping support you and the whole Zotero project.
Thank you for your work.

> designed as a local tool

Nod, a local tool. I have various expectations of my local tools. And if I, say, start Zotero in the morning to read a paper, then exit it for a meeting, then return to it afterward, and then exit for lunch, then at least my own expectations for a local tool are, for example, in tension with those four centralized timestamps. As are the varying tcp routes as I move my laptop among buildings. As is the request when I surf to the NYTimes during lunch.

So what does privacy best practice look like? One comment here suggests the ability to fork and edit the code. Another notes the linked documentation, and being more ethical than Elsevier. The linked page notes the existence of scattered opt-out options. And also "You can avoid these requests by keeping Zotero open while you browse the web."

My own understanding of privacy best practices, includes data exposure being opt-in rather than opt-out, and those privacy preferences being easily seen and changed in one place. My impression is Zotero doesn't do these.

And that's just Microsoft-style privacy practice. It would be even nicer to have knobs, like "check for updates every <start/day/week/...>".

> Criticizing Zotero for privacy, of all things, is a bit bizarre.

I'd be fine with "we have limited resources; know privacy is important; are improving; know we have work to do to implement best practices, are working towards it".

But my own fuzzy long-term impression has been, that such recognition has not been proportional to the potential degree of privacy exposure.

I think it's important to look at these things in the context of the features they're enabling and user expectations. The fact that Zotero is a local, configurable, open-source tool is what gives you complete control over it, but it's not just a local database. It's deeply connected to a world of constantly changing websites, metadata sources, and services, and using most of Zotero's features implies relying on those things. If you want to save metadata from a website, Zotero might need to retrieve metadata from Crossref. If you want it to find an open-access PDF, it needs to connect to an online database to check for one. And if you want saving to continue working as sites changes, it needs up-to-date translators. From a normal user's perspective, the alternative is just Zotero not doing the things they downloaded it to do.

> My own understanding of privacy best practices, includes data exposure being opt-in rather than opt-out

Surely you don't expect software to default to not receiving updates automatically? As the linked section says, if you disable translator/style updates and don't use auto-sync, there won't be a persistent connection. But if a high-profile site breaks and we roll out a fix, the longer the delay the more people will just get an error trying to save.

> those privacy preferences being easily seen and changed in one place

We document every single network request that Zotero makes. Expecting them to all be configurable in one place in the software just isn't reasonable. Normal users think of features, not HTTP requests, and auto-sync doesn't have anything to do with translator update checks.

> I'd be fine with "we have limited resources; know privacy is important; are improving; know we have work to do to implement best practices, are working towards it".

OK, but I'm not saying that. I'm saying we consider privacy in all our decisions and believe we've made the right calls (and, for what it's worth, I can't recall a single complaint about our approach to privacy in many years). If you disagree with a specific decision, that's fine — come to the forums and we can discuss. But let's be clear about the features that would break for users as a result.

Thanks for your thoughtful replies. I see one clear disagreement, and speculate about a more-root divergence.

> configurable in one place in the software just isn't reasonable [...] auto-sync doesn't have anything to do with translator update checks

Microsoft has in one place (something very vaguely like) toggles to control the uploading of web history, hand writing, voice commands, and more. Different features of different apps. With explanations of the functionality lost if the user doesn't opt-in to each. One place, for privacy preferences.

The Zotero privacy documentation page similarly gathers in one place, recipes for opting-out of network-based features, with descriptions of use.

Software preferences having a privacy section is a thing. Firefox, chromium, etc.

I'm unclear on why it isn't reasonable for Zotero software to have similar.

> we consider privacy in all our decisions and believe we've made the right calls [...] If you disagree with a specific decision, that's fine — come to the forums and we can discuss

I suggest there's currently a shift in privacy best practices, from one-size-fits-all "make the right calls", to having user preferences for privacy.

So that's the sort-of clear disagreement.

But part of it may be a deeper difference in perspectives... perhaps call it network minimalism.

When using Zotero, I'd spend more time grovelling over previously collected papers, than collecting new ones. A task that could be done, without loss of functionality, with the net disconnected. My expectation then is, that this local tool, working with local data, will not then start using the net merely because it becomes available. Or rather, that I can easily dissuade such behavior.

Now perhaps that expectation is becoming "old fashioned", as we switch from desktop, to phone apps with only lightly bridled communication lives of their own.

Which might be an underlying issue. Zotero might be thought of as a phone app which just happens to run on desktop-local data. Or it might be thought of as a traditionally desktop application. Design decisions appropriate to the former, might feel a bit odd in the latter. "Local tool" might mean different things.

> I can't recall a single complaint about our approach

In this thread, there was someone suggesting my short paraphrasing of the linked docs was getting it totally wrong. I'm not sure how widely your users are even aware of the approach. It seems users generally aren't. Which, tying things back around, is one of the motivations for having clearly explained privacy preference options.

Thanks for an interesting conversation. Just in case you haven't seen it, the subthread with jmiserez might also be of interest.

> Software preferences having a privacy section is a thing. Firefox, chromium, etc.

Yes, and Firefox's Privacy & Security section doesn't cover Firefox Sync, the default search engine, search bar suggestions, the new tab pane, the default homepage, or app update checks. Those all make network requests to various services, and they're all controlled in their own sections in the preferences where they make more sense. And you can't turn off loading a website when you enter a URL.

Grouping a few more prefs together in Zotero might make sense, but in a modern, web-connected tool, there's just a lot of functionality where the network connectivity is implicit. The main difference in Zotero is that we document it all and tell you how to turn it off.

Specifically switched over to Zotero because of the non-profit status. The only privacy feature request I'd make is to allow some kind of self-hosted sync, i.e. a deployable TLS sync server + preferences entry to specify a sync server ip/port. I imagine it would take load off you guys for syncing, and people would end up hosting sync servers for small groups on university networks.
To be fair, that's a pretty nice link, not many companies provide such an overview. And most (all) of those features are useful things that you'd probably actually want to enable in daily use.

And Zotero can run completely offline, without an account.

Now compare that to Zotero's biggest competitor: https://www.mendeley.com/terms/privacy

Indeed. There does seem an issue of what baseline to use. Elsevier isn't usually thought of as a useful ethical baseline, but here it's a displaced competitor. Companies embed a variety of telemetry. And while much open source doesn't, some does, and this is perhaps increasing. And yet, if say Firefox always maintained an open but empty connection to mozilla, would it be adequate to suggest "well, no, that's not under privacy preferences, but it's documented on our privacy web page, and can be disabled by editing about.config.mumble"? Perhaps with the shift from desktop to phone, expectations of what it means to be local are changing? "Your core data is local, not hostage", but now "of course the app chats on the web... doesn't everything?"?
Well Firefox actually does (Web Push API): https://support.mozilla.org/en-US/kb/push-notifications-fire...

>Firefox maintains an active connection to a push service in order to receive push messages as long as it is open. The connection ends when Firefox is closed.

I see what you're getting at, but I think the harshest criticism should be reserved for the worst services: those that actually hold your data hostage, don't provide an export functionality and use your data in all sorts of unethical ways.

Organisations that actually honestly do value privacy and try to make an effort to get it "right" should be given the benefit of the doubt and constructive criticism, as they might actually listen. In many cases the feature may simply be driven by convenience and the competition (e.g. cloud storage, accounts, sync), and having a toggle for those is the best you can do if they want to stay relevant. In other cases the privacy issue may have simply been overlooked and the feature is improved (IIRC Mozilla has had a few of these).

Maybe a big red "offline-only" toggle would be great, but the absence of that button does not in my eyes disqualify Zotero from being a great offline solution.

Wow. And agreed. One question, re "being a great offline solution", in what sense offline? Able to work without net?

Thinking about dstillman's reply, I was thought-crawling towards a "local/remote vs online/offline" distinction. So Zotero would be local but always online (if net is available). Versus my expectation that when using only local resources, a local tool will be offline.

I think dstillman put it best, the features that make Zotero actually useful need to connect to the internet, like fetching metadata. You wouldn't want to enter everything by hand (but you can).

I think your expectation is reasonable (a local tool will be offline when using only local resources), and it's definitely possible with Zotero (disable the automatic translator/style updates) but just not the default setting.

On a technical level, I don't think there is a huge difference between polling and maintaining a persistent connection, if the polling interval is short or the keepalives are long. The real question that I'd find interesting is why the translator/style updates must be "instant". For my use, once a day or once every few hours would probably be more than sufficient.

> The real question that I'd find interesting is why the translator/style updates must be "instant". For my use, once a day or once every few hours would probably be more than sufficient.

Before Zotero had push-based auto-sync, translator/style updates were indeed once a day, but that meant that, if a high-profile site changed and we pushed out an updated translator, we'd continue to get reports of the site being broken for 24 hours. We could say to update translators manually, but that would only help the people who made it to the forums.

When we added WebSocket support for syncing, we decided to send translator/style update notifications over the same connection. For anyone using auto-sync anyway (many/most users), there's no difference. If you don't use syncing/auto-sync, it's more debatable, but it's a choice between trying not to expose IP addresses that are likely already making at least some other anonymous requests (app updates, retraction checks, OA PDF checks) and decreasing the amount of breakage that users encounter after we've already fixed something.

Nod. One way to address conflicting desires for both opt-in and defaults, is having an onboarding step "want to use net for features... can customize under Preferences/Privacy... [ok]". Informed and consent is somewhat spread in time, and between the step and preferences, but... it seems current best practice.
It's an open source project. Feel free to fork it and remove parts you dislike.
OP's criticism were fair enough to not deserve a dismissive response.

I for one am thankful for their warnings.

See the response by the developer and the actual page linked. It seems op didn't actually read the link he posted and the critique was fully incorrect.
Their warnings are wrong.