Cloudflare System Status

Network Performance Issues in multiple locations
Incident Report for Cloudflare
Postmortem

Transit Provider Backbone Incident
2017-05-02

Incident Description

One of our transit providers had close to 100% connectivity loss between most EU and NA locations. We also saw some loss over this provider between the West and East coasts of USA. Because the level of packet loss increased so quickly, a lot of data required for our network automation system to function properly wasn't reported back to our collectors. As only some of the data was received, this system attempted to disable origin pulls via the affected provider and use other providers. Given a lot of destinations were unreachable, desired actions couldn't be taken in time to avert an impact.

Customer Impact

This incident affected all traffic which was routed over this provider, and will have manifested itself in 522 response codes being generated when we were unable to reach customer origin servers. In addition to this, there will have been general reachability issues as visitors that were routed over this network will not have been able to reach our edge network.

Timeline of events

Time Points of Presence (Colo) Services Description
2017-05-02 14:41 UTC All EU and NA All Poor connectivity via affected transit provider. 522s served if this transit provider is in the path between the Cloudflare colo and the customer's origin.

IMPACT START
2017-05-02 14:46 UTC LHR, VIE, HAM, MXP, MRS, OTP, OSL DXB, ATL, EWR Our network automation system disabled origin pulls via the impacted transit provider

IMPACT DOWNGRADE
2017-05-02 14:51 UTC DME Manually disabled as this colo only has transit over the affected provider.

IMPACT DOWNGRADE
2017-05-02 14:51 UTC YUL Network automation system disabled origin pulls via the impacted transit provider

IMPACT DOWNGRADE
2017-05-02 14:55 UTC BCN Manually dropped colo

IMPACT DOWNGRADE
2017-05-02 14:57 UTC DUB, BOS, MSP, STL, YYZ, DEN, ORD, DFW, SJC, SEA, LAX, FRA Network automation system disabled origin pulls via the impacted transit provider

IMPACT DOWNGRADE
2017-05-02 14:57 UTC ALL Transit issue is essentially resolved and our network automation system was able to take a large amount of actions at this time.

IMPACT DOWNGRADE
2017-05-02 14:59 UTC KBP Manually dropped colo
2017-05-02 15:24 UTC BCN, DME Anycast enabled manually

IMPACT END

Resolution

The root cause for this was resolved by our upstream transit provider. During this incident we disabled traffic on the origin pulls that were routed over the affected provider, and re-enabled them once the incident had been resolved.

Recommendations

We are investigating the possibility of localising our network automation so that in the event that a provider outage causes a colo to be unreachable, the colo can re-route itself to avoid further impact.

Posted May 03, 2017 - 23:10 UTC

Resolved
This issue has been resolved and service has returned to normal.
Posted May 02, 2017 - 15:53 UTC
Monitoring
We have implemented a fix for this issue and are currently monitoring the results. We will update once we have confirmed it is resolved.
Posted May 02, 2017 - 14:59 UTC
Identified
The issue is related to a specific transit provider and we are working on temporarily disabling this provider to route around the issue
Posted May 02, 2017 - 14:55 UTC
Investigating
Cloudflare is observing network performance issues in multiple locations. We are actively working to reduce or eliminate any impact to Internet users in these locations.
Posted May 02, 2017 - 14:46 UTC
This incident affected: Europe (Amsterdam, Netherlands - (AMS), Barcelona, Spain - (BCN), Belgrade, Serbia - (BEG), Brussels, Belgium - (BRU), Frankfurt, Germany - (FRA), Hamburg, Germany - (HAM), London, United Kingdom - (LHR), Madrid, Spain - (MAD), Marseille, France - (MRS), Milan, Italy - (MXP), Moscow, Russia - (DME), Oslo, Norway - (OSL), Paris, France - (CDG), Stockholm, Sweden - (ARN), Vienna, Austria - (VIE), Warsaw, Poland - (WAW), Zürich, Switzerland - (ZRH)), North America (Ashburn, VA, United States - (IAD), Atlanta, GA, United States - (ATL), Boston, MA, United States - (BOS), Chicago, IL, United States - (ORD), Dallas, TX, United States - (DFW), Denver, CO, United States - (DEN), Kansas City, MO, United States - (MCI), Los Angeles, CA, United States - (LAX), Miami, FL, United States - (MIA), Minneapolis, MN, United States - (MSP), Montréal, QC, Canada - (YUL), San Jose, CA, United States - (SJC), Seattle, WA, United States - (SEA), Toronto, ON, Canada - (YYZ), Vancouver, BC, Canada - (YVR)), Asia (Bangkok, Thailand - (BKK), Chennai, India - (MAA), Manila, Philippines - (MNL), Taipei, Taiwan - (TPE), Tokyo, Japan - (NRT)), Africa (Cape Town, South Africa - (CPT), Johannesburg, South Africa - (JNB)), Oceania (Melbourne, VIC, Australia - (MEL), Perth, WA, Australia - (PER), Sydney, NSW, Australia - (SYD)), Middle East (Doha, Qatar - (DOH), Dubai, United Arab Emirates - (DXB)), and Latin America & the Caribbean (Medellín, Colombia - (MDE)).