391043 Stack
📖 Tutorial

OpenAI's MRC Protocol: Solving the Networking Bottleneck in AI Supercomputer Training

Last updated: 2026-05-07 16:46:28 Intermediate
Complete guide
Follow along with this comprehensive guide

Training cutting-edge artificial intelligence models isn't just about raw computational power—it's increasingly about how efficiently data moves between thousands of GPUs. OpenAI recognized this challenge and developed MRC (Multipath Reliable Connection), a new open networking protocol designed specifically for large-scale AI supercomputer clusters. Built in collaboration with industry giants like AMD, Broadcom, Intel, Microsoft, and NVIDIA, MRC was released through the Open Compute Project (OCP) to benefit the broader community. This Q&A explores what MRC does, why networking matters in AI training, and how its three core mechanisms—especially adaptive packet spraying—keep clusters running at peak performance.

What exactly is MRC and why did OpenAI create it?

MRC stands for Multipath Reliable Connection, a networking protocol that optimizes data transfers within AI supercomputers. OpenAI created it because training large AI models involves millions of data exchanges per step, and even a single delayed transfer can cascade into widespread GPU idle time. With over 900 million weekly ChatGPT users, every second of idle GPU capacity represents real cost and capability loss. MRC tackles this by ensuring the network delivers predictable performance, even when failures or congestion occur. The goal is not just speed, but reliability—keeping training jobs moving smoothly. Developed over two years with partners like AMD, Broadcom, Intel, Microsoft, and NVIDIA, MRC is now open for anyone to use through the Open Compute Project.

OpenAI's MRC Protocol: Solving the Networking Bottleneck in AI Supercomputer Training

Why is networking the hidden bottleneck in AI training?

AI model training involves breaking tasks into tiny chunks distributed across thousands of GPUs. These GPUs constantly need to share intermediate results—like gradients—through enormous data transfers inside the supercomputer. A single training step can require millions of such transfers. When a packet arrives late due to network congestion, a link failure, or device trouble, it creates a ripple effect: GPUs sit idle waiting for the missing data, slowing the entire job. As clusters grow larger, these issues become more frequent and harder to manage. That's why OpenAI focused on reducing jitter and delays, turning networking from a weak link into a reliable backbone for massive AI workloads.

How does MRC extend existing protocols like RoCE?

MRC builds on RDMA over Converged Ethernet (RoCE), an industry standard that lets one GPU access another's memory directly over Ethernet, bypassing the CPU for maximum throughput. But RoCEv2 traditionally sends all packets of a transfer along a single path, which leads to congestion. MRC enhances RoCE with techniques from the Ultra Ethernet Consortium and introduces SRv6-based source routing (Segment Routing over IPv6). With SRv6, the sending machine encodes the exact route inside each packet header, so switches don't need complex routing calculations. This reduces processing load and saves power at data center scale. In essence, MRC takes RoCE's hardware-accelerated memory access and makes it reliable and efficient for the largest AI clusters.

What is Adaptive Packet Spraying and how does it reduce congestion?

Instead of confining a data transfer to a single network path, MRC uses a technique called Intelligent Packet-Spray Load Balancing. It spreads packets across hundreds of different paths simultaneously. This prevents any one link from becoming a bottleneck. Traditional RoCEv2 stuck packets on one route from point A to point B, which often led to hotspots. MRC's approach continuously monitors path availability—if a particular route becomes unusable or congested, packets automatically switch to alternative paths. The result is that congestion is distributed evenly across the network fabric, keeping latency low and throughput consistent. This is critical for AI training where even small delays compound across millions of transfers, causing GPU idle time and higher costs.

How does MRC handle failures to maintain training performance?

Network failures—like a broken link or a malfunctioning switch—are inevitable in clusters with thousands of devices. MRC is designed with failures in mind. Its packet-spraying mechanism inherently provides redundancy: if one path fails, packets simply take another. Moreover, MRC's SRv6 source routing allows rapid rerouting without recalculating tables at every switch. This means the network can maintain predictable performance even when components go down. OpenAI states that the goal is not just a fast network, but one that delivers consistent behavior despite faults. For AI training, this stability prevents the ripple effect that would otherwise stall GPU computations, saving time and money in large-scale operations.

Who collaborated on MRC and is it truly open?

MRC was developed through close collaboration between OpenAI and five major technology companies: AMD, Broadcom, Intel, Microsoft, and NVIDIA. Each contributed expertise in hardware, networking, and data-center operations. The specification was published through the Open Compute Project (OCP), an open-source hardware foundation that fosters community-driven innovation. By releasing MRC via OCP, OpenAI invites any organization to use, study, and improve the protocol. This open approach accelerates adoption and ensures that the benefits of MRC—reduced GPU idle time, lower power consumption, and more reliable AI training—can be realized across the industry, not just within OpenAI's own clusters.