This postmortem details a Cloudflare outage caused by an internal software bug leading to the unintentional withdrawal of customer Bring Your Own IP (BYOIP) prefixes via BGP. It highlights system design flaws in automated processes, configuration management, and recovery mechanisms, offering critical lessons in building resilient distributed systems.
Read original on Cloudflare BlogCloudflare experienced a significant outage where a software bug in an automated cleanup sub-task led to the withdrawal of approximately 1,100 customer Bring Your Own IP (BYOIP) prefixes via Border Gateway Protocol (BGP). This incident underscores the critical importance of robust change management, comprehensive testing, and resilient recovery strategies in large-scale distributed systems, especially when dealing with core network functionalities.
The outage was triggered by a new automated sub-task, part of Cloudflare's 'Code Orange: Fail Small' initiative, intended to remove unneeded BYOIP prefixes. A bug in the API query within this sub-task caused it to misinterpret a request for 'pending deletion' prefixes as a request for *all* BYOIP prefixes. Consequently, the system marked all returned prefixes for deletion, systematically withdrawing them from Cloudflare's network.
resp, err := d.doRequest(ctx, http.MethodGet, `/v1/prefixes?pending_delete`, nil)
if v := req.URL.Query().Get("pending_delete"); v != "" {
// ignore other behavior and fetch pending objects from the ip_prefixes_deleted table
prefixes, err := c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
// ...
}API Design and Implementation Pitfalls
The bug highlights a critical API design flaw: accepting an empty string for a query parameter (e.g., `?pending_delete`) that should differentiate between a boolean flag and a value. In this case, `Query().Get("pending_delete")` returned an empty string, which the API logic then interpreted as a request for *all* prefixes, rather than only those explicitly marked `pending_delete=true` or similar. This led to an unintended destructive operation. Strict parameter validation and clear semantic interpretation are crucial for robust APIs.
Cloudflare's Addressing API acts as the authoritative source of truth for IP addresses, with changes immediately propagating to the global network. The process involves customer signals, API instructions to change prefix advertisements, BGP updates on routers, and service bindings to assign products to these ranges. The incident exposed weaknesses in this propagation and configuration management pipeline, particularly where manual processes still existed or where automated processes lacked sufficient validation.