Accelerating Distributed Machine Learning with AllReduce Reconfiguration Based on Optical Circuit Switching
Publication Date: 7/1/2025
Event: OECC/PSC 2025
Reference: TuG3-5: 1-3, 2025
Authors: Zilong Ye, NEC Laboratories America, Inc., California State University; Philip N. Ji, NEC Laboratories America, Inc.; Ting Wang, NEC Laboratories America, Inc.
Abstract: We propose to apply optical circuit switching to enable dynamic AllReduce reconfiguration for accelerating distributed machine learning. With simulated annealing-based optimization, the proposed AllReduce reconfiguration approach achieves 31% less average training time than existing solutions.
Publication Link: https://ieeexplore.ieee.org/document/11110615