GitHub's Reliability Journey: Key Questions and Answers

From 391043 Stack, the free encyclopedia of technology

Following two recent incidents that impacted users, GitHub shared an update on their availability and ongoing efforts to improve reliability. This Q&A covers the key details from that announcement, including the driving forces behind increased demand, the technical challenges faced, and the short- and long-term steps being taken to ensure a more resilient platform.

Why did GitHub release an update about its availability?

GitHub wanted to provide transparency after two incidents that were unacceptable to both the company and its users. The update serves as an apology for the disruption and a detailed explanation of what went wrong, what has already been fixed, and what improvements are in progress. It also signals a shift in priorities: availability now comes first, followed by capacity, and then new features. By sharing specifics—such as the decision to scale from a 10X plan to a 30X plan—GitHub aims to build trust and keep the community informed about the complex work behind the scenes to make the platform more resilient.

GitHub's Reliability Journey: Key Questions and Answers
Source: github.blog

What is GitHub doing to increase its capacity and reliability?

GitHub originally set a plan in October 2025 to increase capacity by 10X, with a focus on improving reliability and failover. By February 2026, it became clear that 10X was insufficient, and the goal shifted to designing for a future requiring 30X the current scale. This involves reducing unnecessary work, improving caching, isolating critical services, removing single points of failure, and moving performance-sensitive code to more capable systems. The work is fundamentally about distributed systems engineering: reducing hidden coupling, limiting blast radius, and ensuring graceful degradation when one subsystem is under pressure.

What is driving the rapid increase in demand on GitHub's infrastructure?

The primary driver is a dramatic change in how software is built. Since the second half of December 2025, agentic development workflows have accelerated sharply. Nearly every measure confirms the direction: repository creation, pull request activity, API usage, automation, and large-repository workloads are all growing quickly. This exponential growth doesn't stress just one system; it affects multiple interconnected services simultaneously, compounding small inefficiencies into larger problems.

How does a single pull request stress multiple GitHub systems?

A pull request touches many subsystems: Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. At high scale, even small inefficiencies compound. Queues deepen, cache misses turn into database load, indexes fall behind, retries amplify traffic, and one slow dependency can affect several product experiences at once. This interconnectedness means that bottlenecks in any single area can cascade, making reliability a whole-system challenge.

GitHub's Reliability Journey: Key Questions and Answers
Source: github.blog

What are GitHub's current priorities and technical approach to improving reliability?

GitHub's clear priorities are: availability first, then capacity, then new features. Technically, they are reducing unnecessary work, improving caching, isolating critical services, removing single points of failure, and moving performance-sensitive paths into systems designed for these workloads. This is classic distributed systems work: reducing hidden coupling, limiting blast radius, and making GitHub degrade gracefully when one subsystem is under pressure. They are making quick progress but acknowledge these incidents show where there is still work to do.

What short-term improvements has GitHub made to address bottlenecks?

In the short term, GitHub resolved a variety of bottlenecks that appeared faster than expected. They moved webhooks to a different backend (out of MySQL), redesigned the user session cache, and redid authentication and authorization flows to substantially reduce database load. They also leveraged the migration to Azure to stand up much more compute capacity. These fixes were necessary to keep pace with surging demand while longer-term architectural changes were being planned.

What longer-term architectural changes is GitHub implementing?

GitHub focused on isolating critical services like Git and GitHub Actions from other workloads to minimize blast radius and reduce single points of failure. This started with careful dependency analysis and traffic tiering to understand what needs to be pulled apart and how to minimize impact on legitimate traffic from attacks. They addressed risks in order of priority. Additionally, they accelerated the migration of performance- or scale-sensitive code from the Ruby monolith into Go. While already migrating from smaller custom data centers to public cloud, they also began work on a path to multi-cloud for even greater resilience.