Cold Storage in the context of Facebook

So one of the great things about returning to university, particularly university in North America, is the number of opportunities to connect with the leading tech companies. The giants like Facebook, Microsoft, Google, and EA all recruit here at UBC and that’s great in giving you something to aim for once you graduate. In part that’s down to proximity, all of the above companies have their main engineering office in North America and inevitably, that’s where they’ll recruit from. Last Tuesday, I attended a talk by one of Facebook’s engineers on cold storage. It’s a topic I knew very little about before but it’s hugely topical given the rate of data generation and the cost of storage so I thought it was worth a share

Cold Storage talk by FB

Cold Storage talk by Facebook – I “happened” to be near all the food

Disclaimer: This is based on my perception of (what seemed to me) a fairly complex talk so some of the details may not be correct and will almost certainly be a gross simplification of how they do things at Facebook but hopefully the general ideas still hold.

Without further ado…

The Problem: What is Cold Storage?
Cold storage has to do with the storage of data which is cold. We distinguish between hot and cold data by determining how easily accessible it has to be which is itself related to how frequently it is used. In the context of Facebook, hot data will be things like those photos you uploaded earlier this week or that last status update you made.  Cold data on the other hand might include those random photos from that night out (you know the one) that you uploaded three years ago and which no one commented on or ever viewed (how sad). Facebook has algorithms to determine whether data is cold or hot.

Since Facebook’s business revolves around their users’ data and none of it ever gets deleted unless you ask them to (i.e. you choose to delete some of your data), the issue of how cold data will be stored is a critical one. Storing cold data such that it is relatively accessible(cold data doesn’t need to be as quickly accessible as hot data since it’s used infrequently) and very redundant (i.e. backed up) is a critical issue, and that all has to be balanced against cost, especially when you hear that there are many millions of pieces of data being created each day by Facebook’s users and that all of it has to be stored for as long as their users want them (i.e. indefinitely).

And add to that, yet another consideration – speed. Or more specifically, the speed at which a failed drive can be re-built. When your operation is as big as Facebook, you end up having hard drives fail frequently and of course, these ideally need to be re-built faster than they fail. I think one of the facts we heard during the talk was that Facebook expects a hard drive to fail once every 18 minutes. And given that one of Facebook’s hard drives can hold up to an exabyte of data, that’s a lot of data that needs to be re-built, and pretty quickly too. So efficiency is another consideration when dealing with the problem of cold storage.

data center

A row in one of Facebook’s data centers

So how does a company like Facebook solve the problem of cold storage?

It gets a bit technical but try to stay with me. I’ll do my best to keep it as conversational as possible. But if you’re reading this, you probably want all the gory technical details!

The fundamental principles of data integrity and redundancy
We’re all familiar with the need to back up our data. Every day software like Time Machine on OS X or System Restore on Windows handles the nuts and bolts of that for you.

Once you look closer, it turns out that there are a number of ways to ensure data is sufficiently redundant. One easy but expensive way is to do full-scale replication. You could replicate all of the data bit for bit and store it in a different hard drive/computer/location. But doing that for any given file will mean a cost increase of nx where n is the number of copies of the file and x is the cost to maintain/store the file.

And also, storing a complete duplicate of a file on another drive isn’t particularly redundant. If that drive and the original one with the file goes down, that’s it. You’re done. No way of recovering your data.

It turns out to be much better to split files into chunks and have each chunk stored on different drives. This also has the added advantage of improved read and write times since lots of machines can perform a task quicker than just a few machines – the power of distributed computing.

But how do chunks solve the issue of the nx cost associated with duplicate copies? That’s where Reed-Solomon encoding comes in. This breaks up data into chunks where r chunks are the data (i.e. the actual data) and k chunks are for parity (this information tells us whether the file is intact). And the idea is that you can lose up to k chunks and still be able to reconstruct the file. For example if you have Reed-Solomon encoding of (10, 4) that means you can lose 4 chunks of the file, whether parity or data chunks, and still be able to reconstruct the file. That means full-scale replication and the big costs associated with is no longer necessary. Reed-Solomon encoding provides significantly improved redundancy at a fraction of the price (the exact extent depends on the values chosen for r and k).

Another stumbling block
That’s all very well and good but if different chunks of files are stored on different drives to ensure redundancy, how do we know where the chunks of a file are stored if we need to rebuild it due to a drive failure? Well, we’ll  need databases to keep track of that. But if a file of 3MB is split up into 14 chunks and a drives holding 1 exabyte of data fails, that’s a lot of database records to lookup and change once the data on that drive is rebuilt from each files’ respective chunks on other drives.

The solution to that is volumes. And these divided into either logical or physical. Both of which are basically just groupings of chunks. This means that instead of updating a record in a database to say that chunk103939 is now in physical volume 1, you just have a record saying that chunk103939 is in logical volume 1 (this would be in one table) and that logical volume 1 is in physical volume 2 (this would be in another table, and would be the one you would update when a physical drive fails).

So you have many orders of magnitude less records to update should a disk fail, i.e. you just have to change the records specifying which physical volume a logical volume is located on (remember that a logical volume represent many chunks of many files).

The final problem to solve then is where to locate volumes? You want volumes to be efficiently allocated across drives to ensure performance, i.e. you don’t want to put all your volumes onto one drive since that puts more load on the physical drive they are stored on while other drives spin idly ( distributed computing is not being effectively used)

You could approach this problem in a number of ways

i) Best fit solution
This involves looking at each volume and determining which one should be filled. It provides the best solution but is expensive.

ii) Random Allocation
Much less expensive and a reasonable solution, it involves randomly allocating volumes to drives.

It turns out that a variation of ii) is the most efficient and effective solution. Completely random allocation is bad in the sense that you will inevitably end up with some drives at maximum capacity while others are very empty. Better still is to random choose two allocation.

This involves randomly selecting two drives and then placing the volume in the more empty one. This results in a fairly even distribution of volumes across all drives, sigma, or the standard variation, decreases significantly. Interestingly enough, increasing the number of choices (random choose three etc.) doesn’t improve the efficient of the allocation significantly given the higher cost involved.

And there you have it, a whistle-stop your through cold storage. I hope I’ve given you a little taste of the problems in this domain because they are certainly fascinating ones.

And it being Facebook, the cold storage team got their project (implementing these principles) up and running in just 8 months.

Move fast, break things. I get that! Now that’s a company I’d like to work for.

Corrections to the technical stuff above on a postcard please (or just leave a comment below). I’ll work on taking better photos next time!

Leave a comment