The biggest hurdle to making quick progress in AI is the lack of compute to train our own original models, yet there are millions of gamers with GPUs sitting around barely getting used, potentially an order of magnitude more compute than Google and Amazon combined. I've figured out a way though we can connect hundreds of computers together to train AI models by using gradient accumulation.
How it works is by doing several training steps and accumulating the loss of each step, then dividing by the amount of accumulation steps taken before the optimizer step. If you have a batch size of 4 and do 256 training steps before an optimizer step, it's like training with a batch size of 1024. The larger the batch size and gradient accumulation steps are, the faster the model converges and the higher final accuracy it achieves. It's the most effective way to use a limited computing budget:
https://www.youtube.com/watch?v=YX8LLYdQ-cA These training steps don't need to be calculated by a single computer but can be distributed across a network.
A decent amount of bandwidth will be required to send the gradients each optimizer step and the training data. Deep gradient compression achieves a gradient compression ratio from 270x to 600x without losing accuracy, but it's still going to be using about 0.5 MB download and upload to train something like GPT2-medium each optimizer step, or about 4-6 mbps on a Tesla T4. However, we can reduce this bandwidth by doing several training steps before contributing gradients to the server. Taking 25 would reduce it to about 0.2 mbps. Both slow and fast computers can contribute so long as they have the memory to hold the model. A slower computer might only send one training step whereas a fast one might contribute ten to the accumulated gradient. Some research needs to be done if a variable accumulation step size impacts training, but it could be adjusted as people join and leave the network.
All that's needed to do this is a VPS. Contributors wanting anonymity can use proxies or TOR, but project owners will need to use VPNs with sufficient bandwidth and dedicated IPs if they wish that much anonymity. The VPS doesn't need an expensive GPU rental either. The fastest computer in the group could be chosen to calculate the optimizer steps. The server would just need to collect the gradients, decompress them, add them together, compress again and send the accumulated gradient to the computer calculating the optimizer step. Or if the optimizing computer has sufficient bandwidth, it could download all the compressed gradients from the server and calculate the accumulated gradient itself. My internet has 200 mbps download so it could potentially handle up to 1000 computers by keeping the bandwidth to 0.2 mbps. Attacks on the network could be mitigated by analyzing the gradients, discarding nonsensical ones and banning clients that send junk, or possibly by using PGP keys to create a pseudo-anonymous web of trust.
Libraries for distributed training implementing DGC already exist, although not as advanced as I'm envisioning yet:
https://github.com/synxlin/deep-gradient-compression
I think this will also be a good way to get more people involved. Most people don't know enough about AI or robotics enough to help but if they can contribute their GPU to someone's robowaifu AI they like and watch her improve each day they will feel good about it and get more involved. At scale though some care will need to be taken that people don't agree to run dangerous code on their computers, either through a library that constructs the models from instructions or something else. And where the gradients are calculated does not matter. They could come from all kinds of hardware, platforms and software like PyTorch, Tensorflow or mlpack.