Open Problems in Decentralized AI Infrastructure — Part 2: Communication, Resource Allocation and Incentives

Sarah Azouvi
10 min readJun 17, 2024

--

This is the second part of this series on open problems in decentralized AI infrastructure. The first part was about verifiable computations and the third part (to be published) will be about data, ownership and privacy.

We now discuss issues related to decentralized system design such as communication, resource allocation and incentives.

Although machine learning computations are highly parallelizable and hence a good candidate for being decentralized, there are many problems that come with decentralized training.

First, as with any distributed systems, communication overhead is a big issue which usually leads to higher latency compared to centralized setups.

Second, in realistic scenarios, the compute nodes can have varying compute and network capabilities. Efficiently leveraging and load balancing across such heterogeneous workers is challenging and impacts the overall training time.

Lastly, there is the issue of how to design incentives that drive the correct type of behavior from all the actors in the system.

Communication

Latency is a significant challenge in decentralized computing, and even more so for ML workloads. The decentralized nature of the system introduces communication overhead, leading to slower processing times compared to centralized setups. For example, Gensyn found that (on CPU) their decentralized training protocol adds a 46% time overhead compared to native training (for MNIST image classification). However, this overhead comes in great part from their verification mechanism, which would not be necessary in a trusted setting.

Aside from the verification overhead, techniques exist that can significantly reduce latency compared to centralized setups, however they come with their own set of challenges. Edge computing, for instance, could be used for inferences that require lighter computation but lower latency. This decentralized approach brings computation closer to the data source and end-user, enabling faster and more responsive processing. By leveraging geographically distributed edge nodes, edge computing can reduce data transmission round-trip time (at least up to 16.7%), which is particularly relevant for inference workloads that require low latency for optimal user experience.

However, realizing the full potential of decentralized edge computing comes with technical challenges. Ensuring network reliability, addressing bandwidth limitations, and handling intermittent connectivity are crucial for seamless edge computing operations. Although proximity to the data and users could also ensure independent operation, even during sporadic internet outage, providing a reliable and uninterrupted experience.

Managing and storing large volumes of data generated at the edge is challenging due to the limited storage and computational capacity of edge devices. Edge devices come in various forms (e.g., smartphones, IoT) each with different hardware and software configurations. Managing this heterogeneity and ensuring interoperability is complex, especially due to the lack of common standards. Interoperability between devices is also a significant challenge for other types of decentralized computing. To address these challenges and open problems, ongoing research and development efforts are required.

Projects like PETALS and StableHorde are already exploring novel approaches to decentralized inference and model serving at the edge. PETALS uses model parallelization, meaning that the model is spread across a few nodes, with each node holding different layers of the models. PETALS uses quantization (a technique we discussed in the first part of this series as a solution to non-determisticity) to both reduce the size of the model and to improve on communication overhead by compressing the hidden states exchanged between different nodes (using dynamic blockwise quantization). For example, it reduces the number of servers needed for BLOOM-176B from 44 to 22, cuts latency in half and halves the bandwidth requirements without noticeably affecting generation quality. This helps deal with memory, which is usually the bottleneck, more than computation, for inference workload. Similarly to edge computing, each client prioritizes connecting to nearby servers to minimize network latency in each communication round.

For training and fine-tuning, data parallelism and model parallelism are the general approaches. Pipeline parallelism was introduced to reduce the communication complexity (and especially circumvent an all-to-all form of communication). Instead of allocating each device with a subset of each layer to compute, each device is assigned one or several layers of the model such that each node only needs to communicate with their adjacent layers.
One issue with pipeline parallelism is that the node with the weakest compute or network capabilities slows down the system and, therefore, nodes with higher capabilities. This problem is specific to heterogeneous networks with possible crashes. To solve this issue, the SWARM protocol introduces temporary dynamic pipelines such that each node can potentially interact with multiple predecessors or successors to maximize their utilization (i.e., they are not waiting on slower nodes). Similarly, this allows handling crash faults. The figure below illustrates the SWARM protocol. Nodes within a stage then run an All-Reduce protocol to average the results of their Stochastic Gradient Descent (SGD) and perform the optimization.

Figure from the SWARM paper: “An overview of SWARM parallelism, illustrating both normal operation, device failures and adaptive rebalancing. One of the workers at stage 2 leaves; another peer from stage 3 takes its place by downloading the latest stage 2 parameters and statistics from peers.”

We note that decentralized optimization algorithms that operate asynchronously with unreliable workers introduce new theoretical questions in terms of their convergence behavior compared to centralized SGD. Providing tight theoretical convergence guarantees is an open problem.

Other techniques such as ZeroOffload can be used to optimize GPU use and offloading data and compute to CPU.

Allocation

One of the main challenges in decentralized machine learning is the allocation of compute and storage. We want to efficiently match inference or training requests with available nodes while ensuring fair allocation, prevent bottlenecks, and optimize for low latency and compute cost.

In PETALS, the allocation of work happens through a load balancing mechanism and a distributed hash table (DHT). The system tries to distribute the transformer blocks evenly across the available servers. The goal is to maximize the total throughput of the model by identifying any bottlenecks and assigning servers to the Transformer blocks with the worst performance.

Servers periodically announce which Transformer blocks they’re currently handling through a DHT. When a new server joins, it checks this DHT to find which block is most needed and it claims this range for itself. After a server has picked its blocks, it broadcasts its own network and compute throughput info to the DHT.

Since servers can drop out or crash at any time, all the nodes in the system regularly check if reshuffling the block assignments would significantly improve the overall throughput. If it would, they reconfigure blocks around until the throughput is as high as possible. If there’s a sudden exodus of servers handling certain blocks, this reallocation process quickly redistributes the remaining servers to plug the holes. So, in essence, the system is always working to balance the load and keep the throughput as high as possible.

The DHT acts as the central bulletin board that lets servers coordinate and divvy up the work. Clients then need to efficiently find the fastest sequence of servers for model execution. For inference, clients ping nearby servers to measure latency and use beam search to find the optimal path. If a server fails, clients remove it, and rerun the routing protocol for a replacement, during inference they would send previous inputs to the new server to maintain attention keys and values.
This approach works well for a system that serves a limited number of models but does not scale very well when the number of nodes and models increase.

For fine-tuning and training, SWARM uses the Interleaved Weighted Round-Robin scheduler for assigning data as well as a DHT that is re-updated every few minutes to account for leaving and joining nodes. In the Interleaved Weighted Round-Robin, each node is assigned a weight proportional to their compute and memory capacity and, as the name suggests, the scheduler assigned works, in a round robin fashion to each node proportionally to their weight. However, to avoid bursty service, instead of assigning all tasks to a single worker before moving to the next, the scheduler interleaves the assignments. It assigns a small batch of tasks to each worker in the current iteration, then moves to the next worker in the circular order, and so on, until all tasks or data are assigned.

In Horde AI, each request is queued and prioritized relative to the amount of kudos, a token specific to the horde ai network. Once ordered, the requests are sent to the first available node.

Another approach to assigning workloads, suggested by SAKSHI, is to use a set of scheduler nodes (which are called routers in their paper) that perform load balancing.

Schedulers maintains the following attributes about each server: its location, hardware capacity, request load (number of requests the provider is currently serving), model capacity (i.e., the set of AI models that the server can provide inference for), and potentially content moderation rules (e.g., NSFW). However, it is important to note that these attributes are not always provably verifiable. Location can be loosely verified by using some form of proof-of-location, e.g., random pings or a decentralized coordinate system.

For attributes that cannot be directly verified, a reputation system can be implemented to reflect whether a compute node performs in accordance with its claimed attributes. Schedulers would then also maintain local reputation scores for each server. Developing proof systems for these other attributes remains an open problem.

The scheduler is trusted with two main tasks. First, it should maintain a correct list of nodes and their attributes, although, as already noted, some attributes are not reliably measurable. Hence, the scheduler should maintain correct attributes to the best of its knowledge. Second, it should route the requests to the best possible nodes according to criteria such as location (to minimize latency) and cost.

Scheduler nodes could, however, collude with some compute nodes and route more jobs toward them, even if they are not the best option in terms of location and/or price. This impacts the fairness of the protocol and its performance, as it could increase latency and costs for users. For the well-being of a decentralized system, especially one with financial rewards, fairness is an important property that schedulers should maintain. Schedulers shouldn’t unfairly prioritize certain nodes over others, i.e., it shouldn’t always be the same nodes that get the jobs. We could incorporate verifiable decentralized randomness to help ensure enough rotation between servers of similar attributes (similarly to how leader election protocols work in proof-of-stake systems), however this is not trivial to design and is an open problem.

An easier approach is to consider a reputation system where users and compute nodes can rate schedulers, whether centralized or peer-to-peer (e.g., using some kind of web of trust architecture).
Another open problem would be to design a fraud-proof system where malicious schedulers can be slashed directly on the blockchain. The main difficulty is to create proofs that attest to misbehavior and can be distinguished from, for example, bad latency. Furthermore, as noted in SAKSHI, compute nodes that observe malicious behavior from a scheduler node could opt-out from this server, enforcing a fairer dynamic market for routing.

The figure below illustrates the (simplified) flow between users, who connect to a scheduler, who then load balance the users’ request to the compute nodes, whose work would be verified by challengers’ nodes (in the case of optimistic machine learning) that use the blockchain as a settlement layer.

Simplified diagram of flow between different actors

Another interesting approach to load balancing and job allocation, although in a centralized setting, is that of ZeroGPU, introduced by HuggingFace. Instead of dedicating a single GPU to each HuggingFace Space, ZeroGPU efficiently shares GPUs across multiple Spaces. It dynamically allocates and releases GPUs as needed, based on the workload demands of each Space. A similar approach could be employed in a decentralized network to ensure that GPUs are not “stuck” on a single application.

Incentives

In many decentralized systems, participants are motivated to contribute and behave honestly through various incentive designs. While some systems rely on altruistic behavior (e.g., PETALS or Tor) or non-financial incentives (e.g., Horde AI), incorporating financial incentives is a popular approach to attract more participants and encourage honest participation in the network.

Financial incentives seem quite natural in the case of a decentralized marketplace for ML compute. Rewards are designed to incentivize various actors within the system to fulfill their respective roles effectively. Compute nodes are incentivized to perform computations correctly and are compensated for their expenses. In the case of optimistic machine learning (that we discussed in the first part of this series), challenger nodes are incentivized to verify computations and, if necessary, initiate an on-chain challenge. Scheduler nodes are incentivized to perform the routing and allocation of tasks efficiently.

Slashing mechanisms are implemented to punish certain undesirable behaviors within the network such as compute nodes performing incorrect computations; challenger nodes spamming the network with baseless challenges; schedulers denying service to clients or favoring a specific set of compute nodes, which would result in other nodes not receiving any work or rewards. As we have already noted, for slashing to be implementable it is important to be able to create concrete proofs of misbehavior, which is not always trivial.

Nodes in the network can receive rewards and compensation from two primary sources, possibly used conjointly: from the users requesting the compute work and from newly minted tokens, as is used in proof-of-work or proof-of-stake minting mechanisms.
SAKSHI proposed the use of unidirectional payment channels for decentralized inferences, as they consist of direct, repeated interactions between users and the compute nodes that execute their tasks. In the optimistic case, where everyone behaves honestly, only a single settlement transaction should be recorded on the blockchain.

Furthermore, a fee of the final transaction could go to the schedulers, incentivizing an open and fair market for schedulers. Users may choose the scheduler they connect to based on location, reputation, UI or fees.

Pricing of requests can be done in a similar way as the Ethereum gas mechanism. Each basic operation corresponds to a specific amount of gas. Multidimensional approaches could also be applicable here. However, designing such a fee mechanism is still an open problem.

Designing an incentive scheme that appropriately rewards and slashes compute nodes, schedulers and challengers (i.e., verifier nodes in the “optimistic approach”) while keeping fees competitive for users is currently an open problem.

Another issue that arises is that the high volatility associated with crypto prices may be an undesirable property for many users. Stablecoins could come into play here to solve this issue. Minted coins on the other hand can help with bootstrapping a new network, attracting more compute nodes from the start of the platform.

I would like to thank Alfonso de la Rocha and Mahmoud Shehata for useful discussions on this topic and Jorge Soares, Alexander Hicks and Sergey Fedorov for feedback on the first draft of this post.

--

--

Sarah Azouvi
Sarah Azouvi

Written by Sarah Azouvi

Researcher - decentralized systems.

No responses yet