Hacker News new | ask | show | jobs
by guites 1185 days ago
Hey! Glad to see flower getting attention on hn.

I've been working on a project for over a year that uses flower to train cv models on medical data.

One aspect that we see being brought up again and again is how we can prove to our clients that no unnecessary data is being shared over the network.

Do you have any tips on solving that particular problem? I.e. proving that no data apart from model weights are being transferred to the centralized server?

Thanks a lot for the project.

edit: Just to clarify I am aware of differential privacy, I'm talking more on a "how to convince a medical institution that we are not sending its images over the network" level.

3 comments

If you're concerned about data leakage, it's worth noting that model weights can very easily be used to reconstruct the original data that it was trained on: so it could be misleading to claim that user data isn't being shared over the network. To avoid this, you'd need to look into techniques like Secure Aggregation or local differential privacy. Flower does provide some of this, FWIW.
This doesn’t sound right, if they don’t know the structure of the NN how can the reconstruct from the weights alone? (Perhaps the structure is communicated within the weights?)
Every agent training the model on their proprietary data has to have access to the model form in some way (otherwise how would they train it?)

For this reason, one must assume that the model form is known to the adversary.

With this, the question becomes: is it possible to reconstruct training data from a trained model? We already know that, at least for some image models, the answer to that question is "yes": https://arxiv.org/pdf/2301.13188.pdf

That must only be true if there isn’t a one way compression step occurring, or any approximation in the whole model.
I don't think lossy compression is sufficient. The very first example in the paper I linked to is clearly not identical to the original image (=lossily compressed) yet leaks a training image in a way that would be highly problematic in certain domains, e.g. medical imaging.
I see what you are saying. Agree. Seems we need some set patterns in NN models that will reliably remove reversibility without effecting loss too drastically.
Hi guites, Thank you! That is undoubtedly something relatable. We have it on the screen and plan to provide helpful material and presentations helping to convince stakeholders. If you are up for a call to share the specific challenges, we could ideate with you.
Would love to! You can grab my email on my profile. Could you ping me over there? Thanks
Thanks, glad you like it!

One approach to increase the transparency on the client side (and build trust with the organization where the Flower clien is deployed) is to integrate a review step that asks the someone to confirm the update that gets send back to the server.

On top of that, you should definitely use differential privacy. To quote Andrew Trask here: "friends don't let friends use FL without DP". Other approaches like Secure Aggregation can also help, depending on what kind of exposure your clients are concerned about.

My general take is that the best way to solve for transparency and trust is to tackle it on multiple layers of the stack.

A review steps sounds like a good idea. Our implementation involves very little interaction on the client side, besides setting up the datasets etc, so maybe a way to log information sent for later inspection would help.

I'll be looking into secure aggregation as I'm not fully aware of how it works. As of now we rely on differential privacy only.

Thanks!

Cool. I saw a proposal to use TEEs for secure aggregation. OpenFL uses Gramine for that. Not sure if that provides sufficient protection, really, but worth having on the radar.

https://arxiv.org/abs/2105.06413 https://openfl.readthedocs.io/en/latest/index.html https://gramineproject.io/

Flower has an agreement to develop interoperable components with OpenFL. This is part of the broader plan by Intel to work with a consortium of players (that includes Flower Labs) and have the output code sit with the Linux Foundation. Enabling TEE support within OpenFL for SA assessible to Flower users is precisely the type of opportunities we seek to make possible by working with Intel on this.

This is the official press release for those who are interesed: https://www.intel.com/content/www/us/en/newsroom/news/transi...

More broadly, in regards too your comment -- our current SA support does not require hardware support, which is what we targeted first, so that can be broadly adopted in many potential hosts of FL aggregation servers. It is suitable for most applications in need of privacy, although still requires certain assumptions to be met such as the number of nodes within a round, and other factors.

What about MPC + DP? Are you planning to integrate any SMPC algorithms on flower or do you find any limitations for not doing so.

I'm trying to apply federated learning to the medical domain too and I'm trying to define the best "stack" that guarantees privacy and compliance with regulations like the GDPR

I can’t speak for Flower’s core dev roadmap, but PySyft is in the process of integrating Flower and some Secure Enclave options which would let you do this.

Congrats on the launch Flower team!

Thanks! We're huge fans of the work that PySyft is doing, and we're very supportive of the Flower PySyft integration.
Agreed that this is an interesting direction. The core Flower abstractions are "federated learning agnostic", which means that they can be used for different kinds of distributed/federated workloads, not just federated learning. We'll add examples for more approaches (like SMPC) in the future, we just don't have the bandwidth to do it immediately.