Description
In this talk we will cover lessons learned along our almost year-and-a-half journey scaling up the WatsonX.AI stack for foundation model pretraining. Starting from 100M parameters on bare metal, we scaled PyTorch training to 20B parameters on cloud-based multi-node systems. We'll discuss the challenges encountered along the way, as well as the solutions we employed. This includes working with the PyTorch team to field test Fully-Sharded and Hybrid-Shard Data Parallel update protocols (FSDP/HSDP), as well as handling the associated communication vs computation bottlenecks, which are not always straightforward. We'll also review our collaboration on cloud-native distributed checkpointing, and development of a stateful and scalable distributed dataloader, allowing us to restart unstable jobs mid-epoch without revisiting stale data. And finally, we'll cover ongoing and upcoming challenges, like maintaining job stability and tensor parallelism integration.