Dynamic Reconfiguration is the process of adaptively adjusting communication structures in distributed machine learning during runtime. Instead of relying on a fixed AllReduce topology, the system uses optical circuit switching to reconfigure network connections as training progresses. Guided by simulated annealing optimization, these changes respond to workload and traffic patterns to minimize communication bottlenecks. This flexible approach enables more efficient synchronization across nodes, leading to faster overall training performance.

Posts

Accelerating Distributed Machine Learning with AllReduce Reconfiguration Based on Optical Circuit Switching

We propose to apply optical circuit switching to enable dynamic AllReduce reconfiguration for accelerating distributed machine learning. With simulated annealing-based optimization, theproposed AllReduce reconfiguration approach achieves 31% less average training time than existing solutions.