Currently, the cost barrier to training state-of-the-art language models is extremely high. GPT-4 is suspected to have cost more than $100 million dollars to train, while Anthropic's CEO predicts that training SOTA models will cost $1 billion dollars this year followed by $10 billion dollars next year. This means that only a small oligarchy of well-funded tech giants now have the ability to train these models.
As these models grow more intelligent and have more impact on our daily lives, so do their owners. They end up deciding how they should be censored and whose values they should incorporate. In effect, this means we get to be governed by AI trained on a constitution we never voted for.
Blockchain, the Decentralised movement, and more specifically Bittensor have proved that they can provide alternatives to this centralised approach by incentivising the masses to pool their resources together to carry out useful work. As Const often mentions, the collective amount of compute that goes into mining Bitcoin far exceeds the compute of any Google, Microsoft, OpenAI or Anthropic data centres.
Granted, Machine Learning requires a different type of compute, but if a decentralised mechanism is able to incentivise that specific type of compute in a similar way whilst accurately validating it, then in theory it can have access to a similar size of compute, if not larger, to train an extremely large single model.
Our proposed solution is a subnetwork that incentivises Compute, Bandwidth and Latency. The compute helps power the training of a miner's local version of a model and the bandwidth and latency helps power the averaging of each miner's local model weights using an operation called butterfly all-reduce. Once this process is successfully completed, each miner has a unified global averaged gradient that it can use to update its model weights.