This looks very interesting. I'd like to see a model trained on the complete body of scientific research literature from the past 100 years or so, I wonder if this approach could facilitate that?
Yes, this would be exciting to see. One approach wouldn't require federated learning however. If you had direct access to the data then you could build a conventionally trained large language model (i.e., collect all the data together placed in a data center). However, given the context of this discussion -- you are probably asking about if we could use Flower to train in a federated manner. I believe so. Although again, we'd probably be training a LLM which brings added complications due to its size (and other factors). Internally at Flower we have been testing methods to overcome this and are confident we can pull this off. One could imagine someone hosting a pre-trained LLM and contributing institutions acting as nodes in the network, each performing some small part of the training based on the fraction of the literature they have access to. We plan to release LLM based federated technology in the coming months.
For those that are interested: The best work currently I've seen on training very large models under federated learning, that also makes very realistic assumptions about the likely underlying participating hardware, is this: https://arxiv.org/abs/2206.11239 -- although I expect more in this direction to come soon.
I'm not sure that this would be as useful as one might think at face value. When you stretch out the training corpus like that you're going to have more noise/inaccuracies/refuted facts then you will have correct information.
It's also unclear how useful full scientific articles are, Microsoft/PubMedBERT interestingly showed PMC abstracts was better than full text.
For those that are interested: The best work currently I've seen on training very large models under federated learning, that also makes very realistic assumptions about the likely underlying participating hardware, is this: https://arxiv.org/abs/2206.11239 -- although I expect more in this direction to come soon.