AllReduce is a communication operation in distributed machine learning that synchronizes data, like model parameters or gradients, across multiple devices. It performs two key operations: reduction, which aggregates data from all participating devices, and broadcasting, which distributes the aggregated result back to all devices to ensure they have the same updated data. This process is crucial for synchronizing parameters or gradients across machines during training.

Posts

Accelerating Distributed Machine Learning with AllReduce Reconfiguration Based on Optical Circuit Switching

We propose to apply optical circuit switching to enable dynamic AllReduce reconfiguration for accelerating distributed machine learning. With simulated annealing-based optimization, theproposed AllReduce reconfiguration approach achieves 31% less average training time than existing solutions.

Accelerating Distributed Machine Learning with an Efficient AllReduce Routing Strategy

We propose an efficient routing strategy for AllReduce transfers, which compromise of the dominant traffic in machine learning-centric datacenters, to achieve fast parameter synchronization in distributed machine learning, improving the average training time by 9%.