Friday, October 9, 2015

Be Cautious about the Storage used by Backup

The files on my PC is massive and unorganized. It takes some effort to manage it. To do some cleanup, I wrote a simple program to find all the duplicate files and tested it with the camera roll folder at One Drive/pictures. I thought it should be a dummy test. What surprised me is the following result:
total number of files: 33347, space occupied: 15.56GB
total number of duplicates: 613, space occupied: 1.62GB

About 20% of files are duplicates even under the backup for the camera roll! My camera could not take pictures that could be exactly same. I inspect both the album on my phone and the camera roll folder on One Drive. The duplicates don't exist at the album on my phone at all, but the camera roll backup on One Drive does show a lot of duplicates. It must be something wrong with One Drive's backup protocol or a bug in the code. These duplicates are only picture files. I don't have an enterprise account. People need to be careful about the backup for the enterprise storage too.

Due to curiosity, I run the program for gdrive photo folder too. Gdrive photo folder doesn't have the duplicates. But I could not open/read gdocs and gsheets at all from the PC. If one day, you don't have internet access, the local gdrive does nothing useful at all!

The duplicate picture files on Camera Roll have the similar names like:
one is DSC06220 1, the other is DSC06220_1.  I guess it is caused by the intermittent network connection for my case as those files are random. It doesn't happen to all the pictures taken at the same time or same day.

It has a reason to keep the files accidentally with the same names. Like one changes the SD card of the phone and start to take new pictures, the names will be reused again. However the things can be worse if someone just get out the SD card, plug it back again, then one ends up all the duplicate files on One Drive.  It should have done something smarter to inspect the contents or do a checksum to avoid to store exact same files/pictures.

If you are paying money for the backup storage, watch out for the space occupied by the duplicates. You can download my code to check duplicates on windows from github.


No comments:

Post a Comment