distributed machine learning Archives

Accelerating Distributed Machine Learning with an Efficient AllReduce Routing Strategy

September 23, 2024/in Publications/by NEC Labs America

We propose an efficient routing strategy for AllReduce transfers, which compromise of the dominant traffic in machine learning-centric datacenters, to achieve fast parameter synchronization in distributed machine learning, improving the average training time by 9%.

SplitBrain: Hybrid Data and Model Parallel Deep Learning

January 3, 2022/in Publications/by NEC Labs America

The recent success of deep learning applications has coincided with those widely available powerful computational resources for training sophisticated machine learning models with huge datasets. Nonetheless, training large models such as convolutional neural networks using model parallelism (as opposed to data parallelism) is challenging because the complex nature of communication between model shards makes it difficult to partition the computation efficiently across multiple machines with an acceptable trade off. This paper presents SplitBrain, a high performance distributed deep learning framework supporting hybrid data and model parallelism. Specifically, SplitBrain provides layer specific partitioning that co locates compute intensive convolutional layers while sharding memory demanding layers. A novel scalable group communication is proposed to further improve the training throughput with reduced communication overhead. The results show that SplitBrain can achieve nearly linear speedup while saving up to 67% of memory consumption for data and model parallel VGG over CIFAR 10.

Posts

Accelerating Distributed Machine Learning with an Efficient AllReduce Routing Strategy

SplitBrain: Hybrid Data and Model Parallel Deep Learning

Contact Us

About Us

Our Pages

Read Our Blog Posts

Tag Archive for: distributed machine learning

Posts

Contact Us

About Us

Our Pages

Read Our Blog Posts