AllReduce is a communication operation in distributed machine learning that synchronizes data, like model parameters or gradients, across multiple devices. It performs two key operations: reduction, which aggregates data from all participating devices, and broadcasting, which distributes the aggregated result back to all devices to ensure they have the same updated data. This process is crucial for synchronizing parameters or gradients across machines during training.

Posts

Accelerating Distributed Machine Learning with an Efficient AllReduce Routing Strategy

We propose an efficient routing strategy for AllReduce transfers, which compromise of the dominant traffic in machine learning-centric datacenters, to achieve fast parameter synchronization in distributed machine learning, improving the average training time by 9%.