📈High Scalability·March 6, 2024

Scaling AWS S3: Architecture, Storage Fleet, and Consistency

This article delves into the foundational architecture and scaling strategies behind AWS S3, a massive multi-tenant object storage service. It explores key components like the storage fleet, replication using erasure coding, and the critical role of parallelism and workload decorrelation in managing I/O at extreme scale, culminating in the adoption of strong read-after-write consistency.

Databases & Storage Distributed Systems Performance & Scaling

Read original on High Scalability

AWS S3: A System Design Overview

AWS S3 is a cornerstone of cloud storage, designed to handle trillions of objects and millions of requests per second. Its longevity and continuous evolution (e.g., Glacier, Intelligent-Tiering, Object Lambda, S3 Express) highlight its adaptable architecture. At its core, S3 operates as a multi-tenant object storage service with an HTTP REST API, built from over 300 microservices. This organizational structure, often described as adhering to Conway's Law, segments S3 into four high-level services: a front-end fleet, a namespace service, a storage fleet, and a storage management fleet. Each operates as an independent business unit, interacting via strict API contracts.

The Storage Fleet: HDDs, Sharding, and Erasure Coding

The foundation of S3's storage is its vast fleet of hard disk drives (HDDs). Despite HDDs' inherent limitations in IOPS and latency, S3 achieves tolerable performance by heavily leveraging parallelism. The core storage nodes are simple key-value stores built on AWS's custom 'ShardStore' backend, which uses a log-structured merge tree (LSM Tree) optimized for HDD I/O. To ensure durability and availability, S3 employs Erasure Coding (EC), a sophisticated redundancy scheme that breaks data into K shards and adds M parity shards, allowing data recovery even if M shards are lost. This approach balances capacity efficiency with I/O flexibility better than simple data replication.

ℹ️

Erasure Coding vs. Replication

While replication (e.g., 3x copies) is simpler for durability, Erasure Coding provides similar fault tolerance with significantly less storage overhead. For example, a (10,6) EC scheme (10 data shards, 6 parity shards) allows for the loss of any 6 shards with only 60% overhead, compared to 200% overhead for 3x replication.

Heat Management, Workload Decorrelation, and Parallelism

A major challenge at S3's scale is managing I/O demand to prevent hotspots. S3 addresses this by broadly spreading data shards across millions of physical drives, offering benefits like hot spot aversion, greater burst I/O capacity, and enhanced durability. Critically, S3 benefits from 'workload decorrelation' due to its multi-tenancy. By aggregating millions of distinct, often idle workloads, the system observes a remarkably smooth and predictable aggregate demand, simplifying load balancing across disks. Parallelism is key, both across client connections to distribute requests to different S3 endpoints and intra-operation (e.g., multipart uploads/downloads) to maximize throughput to a single object.

Achieving Strong Read-After-Write Consistency

A significant architectural evolution for S3 was the introduction of strong read-after-write consistency in 2020. This guarantees that any read request after a successful write will always retrieve the latest version of an object. This was achieved without impacting performance, availability, or cost, indicating a sophisticated re-engineering of its discrete metadata subsystem, which is separate from the object data storage.

AWS S3Object StorageScalabilityDistributed SystemsErasure CodingConsistencyHDDMicroservices

Comments

Loading comments...

Architecture Design

Design this yourself

Design a highly scalable, multi-tenant object storage service similar to AWS S3, capable of handling trillions of objects and millions of requests per second. Your design should incorporate a microservices architecture, leverage a storage fleet utilizing HDDs with optimizations like LSM trees, employ Erasure Coding for data durability and I/O balancing, and implement strategies for managing I/O heat, workload decorrelation, and strong read-after-write consistency.

Focus: object storage with distributed consistency and high availability

Other design angles

· Design only the storage fleet component of an object storage system, focusing on how to manage large numbers of HDDs, ensure durability with Erasure Coding, and optimize for I/O performance under varying load patterns.· Architect the metadata service for a globally distributed object storage system, specifically detailing how strong read-after-write consistency is achieved without sacrificing performance or availability at massive scale.· Design a multi-tier storage solution (like S3's Glacier/Intelligent-Tiering) on top of a base object storage system, considering data lifecycle management, access patterns, and cost optimization.