Training Program: 'Scaling Deep Learning: Strategies for Distributed Training and Inference'

Over the past decade, deep learning has revolutionized numerous applications, driven by a significant increase in model size and complexity. This growth has led to notable improvements in performance but also heightened computational demands. Modern deep learning models are highly data-intensive, and their iterative training processes require substantial computational resources. High-performance computing (HPC) clusters and cloud platforms, equipped with specialized hardware such as GPUs, have become essential for supporting the deployment of these large-scale models. This session will explore the key methodologies of distributed deep learning, which enables the parallelization of training and unlocks the full potential of large models. We will discuss the primary techniques employed in distributed training, followed by a hands-on session where participants will gain practical experience with relevant code examples.

Participants enrolled in this activity must have a laptop, on which they will need to install essential resources in advance for the training session.

The session will begin by addressing the fundamental motivations behind the increasing importance of distributed deep learning, particularly in the context of scaling modern models. It will explore the primary strategies that enable distributed training, with a focus on data parallelism and model parallelism. These approaches are essential for enhancing both the efficiency and scalability of training large models.

Subsequently, the session will examine the ecosystem of libraries and tools available to support distributed training, highlighting key solutions for scaling deep learning workflows. The discussion will also extend beyond training to include inference optimization techniques—such as the use of adapters—for the efficient deployment of large models.

During the hands-on session, participants will gain practical experience using PyTorch’s Distributed Data Parallel (DDP) library. The demonstration will show how to distribute model training across multiple GPUs on an HPC cluster, offering a step-by-step guide to implementing distributed deep learning in a real-world environment.

Attendance for this training is limited to 20 people and requires previous registration.

About the speaker

Rocco Sedona is a computational engineer at the Forschungszentrum Jülich (Jülich Research Centre) in Germany, where he is part of the supercomputing team dedicated to research in artificial intelligence applied to Earth observation. His work focuses on optimizing deep learning (DL) in high-performance computing (HPC) environments, aiming to improve the efficiency of processing large volumes of geospatial data.

In addition to his research work, Rocco Sedona is actively involved in the international scientific community. He has served as an organizer for high-level conferences such as NeurIPS 2023 and ICML 2024, and has contributed to the scientific evaluation of AI-oriented funding programs at the Jülich Supercomputing Centre. His multidisciplinary approach and ability to connect different areas of knowledge position him as an emerging figure at the intersection of AI, high-performance computing, and remote sensing.