good

This incident has been resolved.
Nov 19, 12:03 PM UTC

major

We have resolved the issue but are waiting for queues to catch up.
Nov 19, 11:56 AM UTC

major

We're investigating an issue with two-factor authentication in the GitHub mobile app
Nov 19, 11:38 AM UTC

major

We are currently investigating this issue.
Nov 19, 11:36 AM UTC

good

On October 30, 2024, between 5:45 and 9:42 UTC, the Actions service was degraded, causing run delays. On average, Actions workflow run, job, and step updates were delayed as much as one hour. The delays were caused by updates in a dependent service that led to failures in Redis connectivity. Delays recovered once the Redis cluster connectivity was restored at 8:16 UTC. The incident was fully mitigated once the job queue had processed by 9:24 UTC. This incident followed an earlier short period of impact on hosted runners due to a similar issue, which was mitigated by failing over to a healthy cluster.

From this, we are working to improve our observability across Redis clusters to reduce our time to detection and mitigation of issues like this one in the future where multiple clusters and services were impacted. We will also be working to reduce the time to mitigate and improve general resilience to this dependency.
Oct 30, 9:42 AM UTC

minor

We are continuing to investigate delays to status updates to Actions Workflow Runs, Workflow Job Runs, and Check Steps. Customers may see that their Actions workflows have completed, but the run appears to be waiting for its status to update. We will continue providing updates on the progress towards mitigation.
Oct 30, 8:48 AM UTC

minor

We have identified connectivity issues with an internal service causing delays in Actions Workflow Runs, Workflow Job Runs, and Check Steps. We are continuing to investigate.
Oct 30, 8:05 AM UTC

minor

We are investigating reports of degraded performance for Actions
Oct 30, 7:25 AM UTC

good

On Oct 24 2024 at 06:55 UTC, a syntactically correct, but invalid discussion template YAML config file was committed in the community/community repository. This caused all users of that repository who tried to access a discussion template or attempted to create a discussion to receive a 500 error response.

We mitigated the incident by manually reverting the invalid template changes.

We are adding support to detect and prevent invalid discussion template YAML from causing user-facing errors in the future.
Oct 24, 6:55 AM UTC

minor

We are aware of an issue that is preventing users from creating new posts in Community Discussions (community.github.com). Users may see a 500 error when they attempt to post a new discussion. We are currently working to resolve.
Oct 24, 6:13 AM UTC

minor

We are currently investigating this issue.
Oct 24, 6:12 AM UTC

good

On October 11, 2024, starting at 05:59 UTC, DNS infrastructure in one of our sites started to fail to resolve lookups following a database migration. Attempts to recover the database led to cascading failures that impacted the DNS systems for that site. The team worked to restore the infrastructure and there was no customer impact until 17:31 UTC.

During the incident, impact to the following services could be observed:

- Copilot: Degradation in IDE code completions for 4% of active users during the incident from 17:31 UTC to 21:45 UTC.
- Actions: Workflow runs delay (25% of runs delayed by over 5 minutes) and errors (1%) between 20:28 UTC and 21:30 UTC. Errors while creating Artifact Attestations.
- Customer migrations: From 18:16 UTC to 23:12 UTC running migrations stopped and new ones were not able to start.
- Support: support.github.com was unavailable from 19:28 UTC to 22:14 UTC.
- Code search: 100% of queries failed between 2024-10-11 20:16 UTC and 2024-10-12 00:46 UTC.

Starting at 18:05 UTC, engineering attempted to repoint the degraded site DNS to a different site to restore DNS functionality. At 18:26 UTC the test system had validated this approach and a progressive rollout to the affected hosts proceeded over the next hour. While this mitigation was effective at restoring connectivity within the site, it caused issues with connectivity from healthy sites back to the degraded site, and the team proceeded to plan out a different remediation effort.

At 20:52 UTC, the team finalized a remediation plan and began the next phase of mitigation by deploying temporary DNS resolution capabilities to the degraded site. At 21:46 UTC, DNS resolution in the degraded site began to recover and was fully healthy at 22:16 UTC. Lingering issues with code search were resolved at 01:11 UTC on October 12.

The team continued to restore the original functionality within the site after public service functionality was restored. GitHub is working to harden our resiliency and automation processes around this infrastructure to make diagnosing and resolving issues like this faster in the future.
Oct 12, 1:11 AM UTC

major

We’re continuing to work towards recovery of code search service.
Oct 12, 12:46 AM UTC

major

We’ve identified the issue with code search and are working towards recovery of service.
Oct 12, 12:14 AM UTC

major

We’re continuing to investigate issues with code search.
Oct 11, 11:31 PM UTC

major

We’re continuing to investigate issues with code search. Copilot and Actions services are recovered and operating normally.
Oct 11, 10:57 PM UTC

major

Copilot is operating normally.
Oct 11, 10:16 PM UTC

major

We are rolling out a fix to address the network connectivity issues. Copilot is seeing recovery. support.github.com is recovered.
Oct 11, 10:14 PM UTC

major

Actions is operating normally.
Oct 11, 9:46 PM UTC

major

We continue to work on mitigations. Actions is starting to see recovery.
Oct 11, 9:28 PM UTC

major

The mitigation attempt did not resolve the issue and we are working on a different resolution path. In addition to the previously listed impacts, some Actions runs will see delays in starting.
Oct 11, 8:52 PM UTC

major

Actions is experiencing degraded performance. We are continuing to investigate.
Oct 11, 8:48 PM UTC

major

We continue to work on mitigations. In addition to previously listed impact, code search is also unavailable.
Oct 11, 8:15 PM UTC

major

A mitigation for the network connectivity issues is being tested.
Oct 11, 8:05 PM UTC

major

We continue to work on mitigations to restore network connectivity. In addition to the previously listed impact, access to support.github.com is also impacted.
Oct 11, 7:28 PM UTC

major

We have identified the problem and are working on mitigations. In addition to previously listed impact, new Artifact Attestations cannot be created.
Oct 11, 7:05 PM UTC

major

We have identified the problem is related to maintenance performed in our networking infrastructure. We are working to bring back the connectivity.

Copilot users in organizations or enterprises that have opted into the Content Exclusions feature will experience disabled completions in their editors.

Customer migrations remain paused as well.
Oct 11, 6:41 PM UTC

major

We are investigating network connectivity issues. Some Copilot customers will see errors on API calls and experiences. We have also paused the remaining customer migration queue while we investigate due to an increase in errors.
Oct 11, 6:25 PM UTC

major

We are investigating reports of issues with service(s): Copilot. We will continue to keep users updated on progress towards mitigation.
Oct 11, 5:58 PM UTC

major

Copilot is experiencing degraded availability. We are continuing to investigate.
Oct 11, 5:56 PM UTC

major

We are currently investigating this issue.
Oct 11, 5:53 PM UTC

good

This incident has been resolved.
Oct 08, 11:32 PM UTC

minor

Codespaces is operating normally.
Oct 08, 11:32 PM UTC

minor

Codespace creation has been remediated in this region.
Oct 08, 11:32 PM UTC

minor

We are once again seeing signs of increased latency for codespace creation in this region, but are at the same time recovering previously unavailable resources.
Oct 08, 10:54 PM UTC

minor

Recovery continues slowly, and we are investigating strategies to speed up the recovery process.
Oct 08, 10:10 PM UTC

minor

We are continuing to see gradual recovery in the region and continue to validate the persistent fix.
Oct 08, 9:39 PM UTC

minor

The persistent fix has been applied, and are beginning to see improvements in the region. We are still working on follow-on effects, however, and expect recovery to be gradual.
Oct 08, 9:06 PM UTC

minor

We are nearing full application of the persistent fix and will provide more updates soon.
Oct 08, 8:26 PM UTC

minor

Mitigations we have put in place are yielding improvements in Codespace creation success rates in the affected region. We expect full recovery once the persistent fix fully rolls out.
Oct 08, 7:51 PM UTC

minor

We are continuing to work on mitigations while the more persistent fix rolls out.
Oct 08, 7:17 PM UTC

minor

We are continuing to apply mitigations while we deploy the more persistent fix. Full recovery is expected in 2 hours or less, but more updates will be coming soon.
Oct 08, 6:44 PM UTC

minor

We have applied some mitigations that are improving creation success rates while we work on the more comprehensive fix.
Oct 08, 6:08 PM UTC

minor

We have identified a possible root cause and are working on the fix.
Oct 08, 5:43 PM UTC

minor

Some Codespaces are failing to create successfully in the Western EU region. Investigating is ongoing.
Oct 08, 5:11 PM UTC

minor

Codespaces is experiencing degraded performance. We are continuing to investigate.
Oct 08, 5:08 PM UTC

minor

We are currently investigating this issue.
Oct 08, 5:02 PM UTC

good

On September 30th, 2024 from 10:43 UTC to 11:26 UTC Codespaces customers in the Central India region were unable to create new Codespaces. Resumes were not impacted. Additionally, there was no impact to customers in other regions.

The cause was traced to storage capacity constraints in the region and was mitigated by temporarily redirecting create requests to other regions. Afterwards, additional storage capacity was added to the region and traffic was routed back.

A bug was also identified that caused some available capacity to not be utilized, artificially constraining capacity and halting creations in the region prematurely. We have since fixed this bug as well, so that available capacity scales as expected according to our capacity planning projections.
Sep 30, 11:26 AM UTC

major

Codespaces is operating normally.
Sep 30, 11:26 AM UTC

major

We are seeing signs of recovery in Codespaces creations and starts. We are continuing to monitor for full recovery.
Sep 30, 11:25 AM UTC

major

Codespaces is experiencing degraded performance. We are continuing to investigate.
Sep 30, 11:24 AM UTC

major

We are investigating a high number of errors in Codespaces creation and start.
Sep 30, 11:09 AM UTC

major

We are investigating reports of degraded availability for Codespaces
Sep 30, 11:08 AM UTC

good

Between September 27, 2024, 15:26 UTC and September 27, 2024, 15:34 UTC the Repositories Releases service was degraded. During this time 9% of requests to list releases via API or the webpage received a `500 Internal Server` error. This was due to a bug in our software roll out strategy. The rollout was reverted starting at 15:30 UTC, which began to restore functionality. The rollback was completed at 15:34 UTC. We are continuing to improve our testing infrastructure to ensure that bugs such as this one can be detected before they make their way into production.
Oct 03, 5:37 PM UTC

good

Between September 25, 2024, 22:20 UTC and September 26, 2024, 5:00 UTC the Copilot service was degraded. During this time Copilot chat requests failed at an average rate of 15%.

This was due to a faulty deployment in a service provider that caused server errors from multiple regions. Traffic was routed away from those regions at 22:28 UTC and 23:39 UTC, which partially restored functionality, while the upstream service provider rolled back their change. The rollback was completed at 04:41 UTC.

We are continuing to improve our ability to respond more quickly to similar issues through faster regional redirection and working with our upstream provider on improved monitoring.

Sep 26, 5:08 AM UTC

minor

Monitors continue to see improvements. We are declaring full recovery.
Sep 26, 5:08 AM UTC

minor

Copilot is operating normally.
Sep 26, 5:03 AM UTC

minor

We've applied a mitigation to fix the issues and are seeing improvements in telemetry. We are monitoring for full recovery.
Sep 26, 3:51 AM UTC

minor

We believe we have identified the root cause of the issue and are monitoring to ensure the problem does not recur.
Sep 26, 2:34 AM UTC

minor

We are continuing to investigate the root cause of the latency previously observed to ensure there is no reoccurrence, and better stability going forward.

Sep 26, 1:46 AM UTC

minor

We are continuing to investigate the root cause of the latency previously observed to ensure there is no reoccurrence, and better stability going forward.
Sep 26, 1:03 AM UTC

minor

Copilot users should no longer see request failures. We are still investigating the root cause of the issue to ensure that the experience will remain uninterrupted.
Sep 26, 12:29 AM UTC

minor

We are seeing recovery for requests to Copilot API in affected regions, and are continuing to investigate to ensure the experience remains stable.
Sep 25, 11:55 PM UTC

minor

We have noticed a degradation in performance of Copilot API in some regions. This may result in latency or failed responses to requests to Copilot. We are investigating mitigation options.

Sep 25, 11:40 PM UTC

minor

We are investigating reports of degraded performance for Copilot
Sep 25, 11:39 PM UTC

good

On September 25th, 2024 from 18:32 UTC to 19:13 UTC, Actions service experienced a degradation during a production deployment, leading to actions failing to be downloaded at the start of a job. On average, 21% of Actions workflow runs failed to start during the course of the incident. The issue was traced back to a bug in an internal service responsible for generating the URLs used by the Actions runner to download actions.

To mitigate the impact, we rolled back the affecting deployment. We are implementing new monitors to improve our detection and response time for this class of issues in the future.
Sep 25, 7:19 PM UTC

minor

We're seeing issues related to Actions runs failing to download actions at the start of a job. We're investigating the cause and working on mitigations for customers impacted by this issue.
Sep 25, 7:14 PM UTC

minor

We are investigating reports of degraded performance for Actions and Pages
Sep 25, 7:11 PM UTC

good

On September 25, 2024 from 14:31 UTC to 15:06 UTC the Git Operations service experienced a degradation, leading to 1,381,993 failed git operations. The overall error rate during this period was 4.2%, with a peak error rate of 12.5%.

The root cause was traced to a bug in a build script for a component that runs on the file servers that host git repository data. The build script incurred an error that did not cause the overall build process to fail, resulting in a faulty set of artifacts being deployed to production.

To mitigate the impact, we rolled back the affecting deployment.

To prevent further occurrences of this cause in the future, we will be addressing the underlying cause of the ignored build failure and improving metrics and alerting for the resulting production failure scenarios.
Sep 25, 4:03 PM UTC

minor

We are investigating reports of issues with both Actions and Packages, related to a brief period of time where specific Git Operations were failing. We will continue to keep users updated on progress towards mitigation.
Sep 25, 3:34 PM UTC

minor

We are investigating reports of degraded performance for Git Operations
Sep 25, 3:25 PM UTC

good

On September 24th, 2024 from 08:20 UTC to 09:04 UTC the Codespaces service experienced an interruption in network connectivity, leading to 175 codespaces being unable to be created or resumed. The overall error rate during this period was 25%.

The cause was traced to an interruption in network connectivity caused by SNAT port exhaustion following a deployment, causing individual Codespaces to lose their connection to the service.

To mitigate the impact, we increased port allocations to give enough buffer for increased outbound connections shortly after deployments, and will be scaling up our outbound connectivity in the near future, as well as adding improved monitoring of network capacity to prevent future regressions.
Sep 24, 9:04 PM UTC

major

Codespaces is operating normally.
Sep 24, 9:04 PM UTC

major

We have successfully mitigated the issue affecting create and resume requests for Codespaces. Early signs of recovery are being observed in the impacted region.
Sep 24, 9:01 PM UTC

major

Codespaces is experiencing degraded performance. We are continuing to investigate.
Sep 24, 9:00 PM UTC

major

We are investigating issues with Codespaces in the US East geographic area. Some users may not be able to create or start their Codespaces at this time. We will update you on mitigation progress.
Sep 24, 8:56 PM UTC

major

We are investigating reports of degraded availability for Codespaces
Sep 24, 8:54 PM UTC