This article delves into the foundational architecture and scaling strategies behind AWS S3, a massive multi-tenant object storage service. It explores key components like the storage fleet, replication using erasure coding, and the critical role of parallelism and workload decorrelation in managing I/O at extreme scale, culminating in the adoption of strong read-after-write consistency.
Read original on High ScalabilityAWS S3 is a cornerstone of cloud storage, designed to handle trillions of objects and millions of requests per second. Its longevity and continuous evolution (e.g., Glacier, Intelligent-Tiering, Object Lambda, S3 Express) highlight its adaptable architecture. At its core, S3 operates as a multi-tenant object storage service with an HTTP REST API, built from over 300 microservices. This organizational structure, often described as adhering to Conway's Law, segments S3 into four high-level services: a front-end fleet, a namespace service, a storage fleet, and a storage management fleet. Each operates as an independent business unit, interacting via strict API contracts.
The foundation of S3's storage is its vast fleet of hard disk drives (HDDs). Despite HDDs' inherent limitations in IOPS and latency, S3 achieves tolerable performance by heavily leveraging parallelism. The core storage nodes are simple key-value stores built on AWS's custom 'ShardStore' backend, which uses a log-structured merge tree (LSM Tree) optimized for HDD I/O. To ensure durability and availability, S3 employs Erasure Coding (EC), a sophisticated redundancy scheme that breaks data into K shards and adds M parity shards, allowing data recovery even if M shards are lost. This approach balances capacity efficiency with I/O flexibility better than simple data replication.
Erasure Coding vs. Replication
While replication (e.g., 3x copies) is simpler for durability, Erasure Coding provides similar fault tolerance with significantly less storage overhead. For example, a (10,6) EC scheme (10 data shards, 6 parity shards) allows for the loss of any 6 shards with only 60% overhead, compared to 200% overhead for 3x replication.
A major challenge at S3's scale is managing I/O demand to prevent hotspots. S3 addresses this by broadly spreading data shards across millions of physical drives, offering benefits like hot spot aversion, greater burst I/O capacity, and enhanced durability. Critically, S3 benefits from 'workload decorrelation' due to its multi-tenancy. By aggregating millions of distinct, often idle workloads, the system observes a remarkably smooth and predictable aggregate demand, simplifying load balancing across disks. Parallelism is key, both across client connections to distribute requests to different S3 endpoints and intra-operation (e.g., multipart uploads/downloads) to maximize throughput to a single object.
A significant architectural evolution for S3 was the introduction of strong read-after-write consistency in 2020. This guarantees that any read request after a successful write will always retrieve the latest version of an object. This was achieved without impacting performance, availability, or cost, indicating a sophisticated re-engineering of its discrete metadata subsystem, which is separate from the object data storage.