Monday, May 13, 2019

Cloud Storage Architecture


A brief summary of cloud storage architecture:
There are two types of APIs that distributed storage vendors provide:
1. A file system compatible API, such as GFS, Hadoop HDFS, and Ali Pangu. To provide a file system API, the metadata servers are required as shown in the following pictures named as Master or M.
2. An object store API or NoSQL API, such as Amazon S3, DynamoDB, Couchbase. Usually it doesn't need metadata servers and has a table that recording the shading information and this table is replicated at every node and cached at the client side.

At the end of the game, it mostly uses the Ext2/3 file system at the Chunk Server or Data Nodes. Amazon never publishes its S3 architecture, it could build on top of volumes.

The architecture details shown in the following graphs:

2003 Google File System first published at SOSP 2003. It is the icon of most modern cloud storage/file system.



2011 Microsoft Azure: an more comprehensive paper about its cloud. It is stream layer is quite similar to GFS by replacing extent nodes with chunk servers:


Both DynamoDB and CouchDB using consistent hashing:a physical nodes has a set of virtual nodes. All virtual nodes are assigned a range in the ring. All the virtual nodes in the same color locate at the same physical node. One can replace the consistent hashing algorithm with simple modulo operation for simplicity. Consistent hashing is a probability model that gives high probability load balancing and the least data movements while scaling out. The modulo operation gives good load balancing only the system is doubling the number of nodes each time, but it is simple to manage.
Image result for dynamo db consistent hashing