Monday, May 13, 2019

Cloud Storage Architecture


A brief summary of cloud storage architecture:
There are two types of APIs that distributed storage vendors provide:
1. A file system compatible API, such as GFS, Hadoop HDFS, and Ali Pangu. To provide a file system API, the metadata servers are required as shown in the following pictures named as Master or M.
2. An object store API or NoSQL API, such as Amazon S3, DynamoDB, Couchbase. Usually it doesn't need metadata servers and has a table that recording the shading information and this table is replicated at every node and cached at the client side.

At the end of the game, it mostly uses the Ext2/3 file system at the Chunk Server or Data Nodes. Amazon never publishes its S3 architecture, it could build on top of volumes.

The architecture details shown in the following graphs:

2003 Google File System first published at SOSP 2003. It is the icon of most modern cloud storage/file system.



2011 Microsoft Azure: an more comprehensive paper about its cloud. It is stream layer is quite similar to GFS by replacing extent nodes with chunk servers:


Both DynamoDB and CouchDB using consistent hashing:a physical nodes has a set of virtual nodes. All virtual nodes are assigned a range in the ring. All the virtual nodes in the same color locate at the same physical node. One can replace the consistent hashing algorithm with simple modulo operation for simplicity. Consistent hashing is a probability model that gives high probability load balancing and the least data movements while scaling out. The modulo operation gives good load balancing only the system is doubling the number of nodes each time, but it is simple to manage.
Image result for dynamo db consistent hashing

Monday, May 29, 2017

Visting Pigeon Point Light House

Piaw missed the ocean view Jacuzzi at the hostel. Do we planned a trip to visit there.  There are a few routes to get there. The one I like most starting from the intersection of 35(skyline blvd) and 84(La Honda Rd), following La Honda Rd to the south, then change to Pescadero Creek Rd.

A lot of motorcyclists and cyclists start from the intersection. It is an unbeatable drive through the redwood forest. It is on my list to do a bike tour there soon!



There is a little farm called Harley Farms Goat Diary along Pescadero. It is a sweet stop to rest and let kids play around. Another nice stop at Arcangeli Grocery Co - Norm’s Market. It has the best artichoke bread. There is a wonderful picnic area behind the shop too. And the beach is not that far from there. It needs luck to have a sunny clear sky though.



Don't forget to make reservation to use the Jacuzzi if you stay overnight at the hostel at Pigeon Point Light House.

I have a previous post about visiting half noon bay. If you stay overnight at the hostel, you can go up to visit half noon bay on the second day. The related post is here:
http://xiaoqinsiliconvalley.blogspot.com/2015/08/


Friday, May 26, 2017

Hash or Sort

Last week, I read a blog post that shows to find the unique elements in an array, sorting an array is faster than building a hash table.  At the first glance I thought, ya, the random memory access cost is high. But when I sat down writing down the simple formula for 1000,000 elements for nlogn, I could believe that random memory access could be 20 times slower than sequential access.

I did remeasure the random memory access to exercise the cache miss penalty, it is 2 or 3 times penalty comparing sequential memory access. So why building a hash table is so slow?

I redid the experiment with the following:
Each element is 8 bytes.
1. sorting the 1000,000 elements takes K time.
2. the same dataset, building a hash table takes 2K time with std::unordered_map.
3. after building the hash table, do a complete lookup, it takes 3/4K time.
4. wrote my own hashtable similar to google dense hash, with % as the hash function, it takes 3/4K time too.
5. optimized my hashtable function, replace % with bit operation &, it takes 1/2K time.

So the poor performance of the hash claimed is the combination of poor implementation, poor hash function selection, and random memory access effect.

The std::unordered_set really has poor performance at least for 8 byte objects.

Next step, I will test the unordered_map performance with real (key, val) pairs with big val.

Wednesday, January 13, 2016

Fun Program to Count 1s

int count_1_bit(char b) {
int tot = 0;
if ((b & 1) == 0) tot++; 
if ((b & 2) == 0) tot++;
if ((b & 4) == 0) tot++;
if ((b & 8) == 0) tot++;
if ((b & 16) == 0) tot++;
if ((b & 32) == 0) tot++;
if ((b & 64) == 0) tot++;
if ((b & 128) == 0) tot++;
return tot;
}

#define __const_hweight8(w)     \
    ((unsigned int)         \
     ((!!((w) & (1ULL << 0))) + \
      (!!((w) & (1ULL << 1))) + \
      (!!((w) & (1ULL << 2))) + \
      (!!((w) & (1ULL << 3))) + \
      (!!((w) & (1ULL << 4))) + \
      (!!((w) & (1ULL << 5))) + \
      (!!((w) & (1ULL << 6))) + \
      (!!((w) & (1ULL << 7)))))


static inline unsigned int hweight8(unsigned int w)
{
        unsigned short res = (w & 0x55) + ((w >> 1) & 0x55);
        res = (res & 0x33) + ((res >> 2) & 0x33);
        return (res & 0x0F) + ((res >> 4) & 0x0F);
}

Sunday, November 8, 2015

Photo Processing Toy

These days when we travel, we get photos from phone,  camera,  friend's phone, and friend's camera. Each device has its own naming system, so it is hard to sort them according to the names from all devices. Luckily the picture files have metadata that captures the time when the picture is taken. I hacked down to the metadata and extracted both the time and the place the picture is taken. By setting date and place information into the pictures' file names, they can not only be easily sorted but also the places I have traveled are clearly shown up. While organizing my pictures, surprisingly, one camping place I thought in Santa Cruz Capitola, it is in Aptos in fact. So if you would like to know the exact place the picture is taken, it is really helpful.

While I was searching a good app to mark pictures on the map, Intagram claimed it has the functionality. However when I tried it, it is such a lame product. Intagram doesn't use the pictures location information at all, instead it uses the location when the user is sharing the pictures. If one takes a bunch of pictures on the trip and later share those pictures when one gets home, the home location is used to mark the pictures on the map :(

In this toy too, I also mark the places extracted from the picture in the map by scanning the pictures' GPS information. At the end, a nice trip journal map is generated. By clicking on the marker point, the picture taken there will pop up.

To use the software, first install python and PIL library.
Run the program with the absolute path of the album.
The program will rename all pictures in the album with date and place information if those information is available in the metadata.
At the end, a tripmap.html file is generated that can be viewed in a browser. All the places where pictures are taken will be marked on the map and by clicking the icon, the picture taken at the specific place will pop up.
Example:
/* B:\test is the absolute path for the directory of pictures */
/* -rTrue means all the files will be renamed, if -rFalse, the files will not be renamed but a map of the trip with pictures markers will still be generated */
python photo.py -iB:\test -rTrue

The code locates at:
https://github.com/xiaoqin2012/release_photo

This is one output example from trip in Europe:
This is for the sailing trip from Victoria to Desolation Sound in Canada:

Sunday, October 11, 2015

Visiting British Columbia Day 4: Sailing to Bowen Island

Yer. There is a Bowen Island. Bowen says, "It is a giant island!". Believe it or not, go to check out by yourself. It is a pretty island. If Bowen wants to visit it again, I will do.














Friday, October 9, 2015

Be Cautious about the Storage used by Backup

The files on my PC is massive and unorganized. It takes some effort to manage it. To do some cleanup, I wrote a simple program to find all the duplicate files and tested it with the camera roll folder at One Drive/pictures. I thought it should be a dummy test. What surprised me is the following result:
total number of files: 33347, space occupied: 15.56GB
total number of duplicates: 613, space occupied: 1.62GB

About 20% of files are duplicates even under the backup for the camera roll! My camera could not take pictures that could be exactly same. I inspect both the album on my phone and the camera roll folder on One Drive. The duplicates don't exist at the album on my phone at all, but the camera roll backup on One Drive does show a lot of duplicates. It must be something wrong with One Drive's backup protocol or a bug in the code. These duplicates are only picture files. I don't have an enterprise account. People need to be careful about the backup for the enterprise storage too.

Due to curiosity, I run the program for gdrive photo folder too. Gdrive photo folder doesn't have the duplicates. But I could not open/read gdocs and gsheets at all from the PC. If one day, you don't have internet access, the local gdrive does nothing useful at all!

The duplicate picture files on Camera Roll have the similar names like:
one is DSC06220 1, the other is DSC06220_1.  I guess it is caused by the intermittent network connection for my case as those files are random. It doesn't happen to all the pictures taken at the same time or same day.

It has a reason to keep the files accidentally with the same names. Like one changes the SD card of the phone and start to take new pictures, the names will be reused again. However the things can be worse if someone just get out the SD card, plug it back again, then one ends up all the duplicate files on One Drive.  It should have done something smarter to inspect the contents or do a checksum to avoid to store exact same files/pictures.

If you are paying money for the backup storage, watch out for the space occupied by the duplicates. You can download my code to check duplicates on windows from github.