Thursday, May 3, 2012

What the real online storage should be

Back in 2004, I was looking for online storage products, but I didn't find one. After a while, google launched google docs. I was happy with it at the beginning, as it allowed me to access my files anywhere as long as I have access to internet. As I put more files online, I feel frustrated as the user interface of google doc is not friendly at all. Also, I could not edit my files if there is no internet access, and the space is not big enough to store all my pictures. After a few frustrations due to intermittent internet loss, I finally gave up and switched to skydrive.

I have a free dropbox account, I don't use it much. I got to know it as my friends try to share files with me. The hesitation is due to its strange synchronization protocol and the space usage. Piaw and I shared a file on it and were editing the file at the same time, we ended up with a lot of versions of the file and I could not recognize which one to pick or how to integrate them. It is really frustrating. I prefer it has a simple locking algorithm: when one opens the file for write, just lock the file and don't give other users the write permission until the lock is released. Furthermore, all versions of the file occupies both the online storage space and the local storage space. So I mark dropbox as a file sharing tool, not a real online storage solution. Google just launched gdrive, the experience of sharing files is even more hilarious. First one receives a notification in the email about the sharing of the file, one needs to click the link in the email to accept the sharing. However the story doesn't really end here, the shared files are not in mydrive yet, I still need to drag it to mydrive folder in the web browser first, then it shows up at local copy mydrive folder. It doesn't really do anything particular for concurrent writes. It just gives out an unresolvable error. The local skydrive folder doesn't contain the shared folder at all, so one can only open the shared files by web browser. It also means when there is no internet access, the shared files are not accessible at all. Oh my, what is the local skydrive for then!

For real online storage products, I expect:
1. All my files are safely stored, it is not a backup of my local files. It has the complete collection of all my files. Technically, the online storage is a layer after the local storage. The local storage act as a cache layer for the online storage.   
2. If there is an internet access, I can access my files. So I don't need to carry a copy.
3. If I make some changes for a file while I am offline, the changes can be transparently synced when the internet access comes back and I don't need to do it manually.
4. It is easy to share. One can easily access my files by the web. The file can be write semi-concurrently at least. The synchronization work should be completely transparent to the user. 

For now, none of dropbox, gdrive, skydrive has done all of this.  Skydrive keeps a complete copy of all files locally.  If one buys 1TB of space, one will have 1TB of complete copy locally. This idea is just so wrong. For both dropbox and gdrive, one can specify which folders to be synced locally. It is kind of dilemma here. If one doesn't specify any folders, with default, one has the complete copy locally which may consume a lot of space. If one specifies which one to be synced to local, when a new folder created or shared, one needs to make a decision if it should be synced locally or not and also need to do some work to do that. If one file or folder occupies 1TB of space, one doesn't want to have 1TB of complete copy locally. So the user needs to think hard to decide which files or folders to be cached. The sharing implementation is quite awkward both in skydrive and gdrive. Overall, dropbox is better than gdrive and skydrive, but it is just a sharing tool, not really an online storage solution.

For an online storage solution, theoretically,  I would imagine it is adding a layer to the system. The local storage acts as the cache layer for the online storage, and the system now has five layers: CPU, cache(L1, L2, L3 caches), RAM, local disk, online storage.  The relationship of local storage to online storage is like memory to local disk. If internet access is not available, one could not access some files. It is challenging to design an intelligent caching algorithms to let users have more pleasant experience when the internet is not accessible. The caching algorithm can combine MRU (most recently used), MA (mostly accessed), and prefech files according the user's access pattern. To manage shared files and deal with concurrent writes, it is a classical problem and there are many solutions out there, like single writer locking protocol, or snoopy protocol.