Hacker News new | ask | show | jobs
by mryab 1979 days ago
Not directly related, but the Learning@home [1] project aims to achieve precisely that goal of public, volunteer-trained neural networks. The idea is that you can host separate "experts," or parts of your model (akin to Google's recent Switch Transformers paper) on separate computers.

This way, you never have to synchronize the weights of the entire model across the participants — you only need to send the gradients/activations to a set of peers. Slow connections are mitigated with asynchronous SGD and unreliable/disconnected experts can be discarded, which makes it more suitable for Internet-like networks.

Disclaimer: I work on this project. We're currently implementing a prototype, but it's not yet GPT-3 sized. Some issues like LR scheduling (crucial for Transformer convergence) and shared parameter averaging (for gating etc.) are tricky to implement for decentralized training over the Internet.

[1] https://learning-at-home.github.io/

3 comments

Your project looks so interesting. Have u thught of putting the experts on a distributed market where their expertise and work can be exchanged for some token (obvlsy using a blockchain).

This would encourage people to host experts in your network and would create value.

Thank you! This is definitely something we should look into in the future (hopefully with community help); as of now, training infrastructure and model convergence are the highest priorities. That said, we welcome all ideas of ways to motivate more volunteers to join the experiments, because Learning@home team comes from a distributed DL background with limited volunteer computing expertise.

Also, I believe that for some projects (e.g. GPT-3 replication effort) people would want to join the network regardless of the incentive mechanism, as demonstrated by Leela Chess Zero [1].

[1] http://lczero.org/

How do you deal with adversarial/byzantine updates that attempt to degrade performance or even install a backdoor? Do you use plain averaging, or some other aggregation algorithm like Multi-Krum?
For now, the only separation we have is that each worker is responsible for its own weights, since network security has not been our top priority. Still, we've been thinking about adding some security measures like proof-of-work for each node and detection of anomalous inputs/gradients (or simply NaN values). Right now we're running experiments on internal hardware, but before a public launch we'll make sure that malicious participants won't put everybody else's work to waste :)
This is also what I was thinking about. Considering that making up bad data does not require any GPU work as opposed to honest calculating nodes, the model can fall quickly if without taking some measures to deal with them (adverserial nodes).

A draft solution would be for the central server to measure the goodness of each update and drop the ones that don't perform well. This could somehow work since inference is much cheaper than gradients computing.

Do you have a personal Twitter account I can follow? Your career is one I'd like to follow.
Sure! It's @m_ryabinin
Thanks! :D