The Best Laid Plans Gang Aft Agley (often go astray)

I started the year with best intentions for this blog and then let it slide completely – my bad. When I left you last, it was the new year and I had a couple of big goals – the most important of which was landing an internship for the summer at a top-tier company.

And wow, that was an adventure to say the least.

Applying for a summer internship

Being based outside of the USA but interested in internships there made the application process hard. It wasn’t always easy to figure out who would sponsor a J1 visa for the internship and who wouldn’t (apart from the really big companies – they all do) and so there was a lot of wasted time with applications.

Even before that though, and the reason for why I took a break from writing this blog, is that I had an offer fall through after putting all my eggs in one basket so to speak. At the end of December last year, I found a really exciting top-incubator-backed startup that I was interested in and so I pulled out all the stops to get an internship there. When they asked for an answer to a screening coding problem, I gave them answers in three different programming languages. So yeah, I really went all in – even working on the take-home project on a sunny beach during winter break and fully-testing everything in a new language and framework which I knew fit the task, but which I hadn’t previously used.

Don’t count your chickens

And so… I was super stoked to hear that I received an offer from them in early January and was pleased that my internship search had finished. But on the same day I received a congratulatory fruit basket, the company withdrew the offer because they didn’t want to sponsor my J1 intern visa. And that sucked. Mostly because I heard back via email rather than a phone call, also because I was eating a chocolate covered strawberry (from the aforementioned fruit basket) when I got the message. And above all, because I’d been given lots of reassurance throughout the interview process that getting a visa for the internship wouldn’t be a problem.

But hey, I get it. It’s a startup and things change. You’ve got to go with the flow. And so the real search began.

Starting the internship search again

At this stage, I had a couple of interviews with top-tier companies lined up which I totally bombed because I’d spent so much time focusing on getting the job at the startup. Not great for morale by any means. Desperate times call for desperate measures and so I stopped going to a lot of my classes so I could focus entirely on interview prep (more on that in another post) and applications. I knew I had a really good GPA and could afford to let classes slide a bit.

I sent out a bunch of applications to companies I was interested in. Got replies from lots of them saying they weren’t hiring students from Canada just yet, and also a few interview invitations and a couple of rejections. It was really interesting to see companies taking very different approaches to resume screening – a few of them were really open to non-American schools that weren’t Waterloo (a top-tier CS school that is based in Ontario, Canada); UBC is a good school, maybe like 30 or 40 in the world (whatever that means) but not yet a great one. Equally, there were a few well-known companies who used keyword scanning and had automated rejections sent out – there is no way anyone can screen a resume at 11pm in an hour! The same thing happened to all of my classmates, some more qualified than me I’m sure, and that’s just lame.

And so I kept plugging away. I got pretty far with a well-known consumer review company but since it was late in the application process, they ended up adding extra rounds of interviews to the loop (without telling me!) and it ended up being a question of headcount – I think. It felt like they started screening people out not based on whether they passed the interviews or not, but on who was perceived as the “best” of the people remaining. My resume is interesting, I’ve been to a top-tier school, Cambridge, but not in CS and it’s hard for a UBC CS student to go head-to-head on paper with one from a CS school like Carnegie Mellon. And that kind of sucked because it felt like even being objectively good enough to be an intern at the company wasn’t sufficient, you had to be better than the other candidates (or at least perceived that way). But their loss perhaps, in any event my interest got a little killed by the acqui-hire who interviewed me in the penultimate round who really didn’t sound interested in working for the tech company at all! Not exactly a good sign.

But just when things started to look a little bleak, I received three offers in the matter of a week and a bit. Yep. I had offers from a pre-IPO Canadian company, a real estate tech company, and…Facebook.

The big two year bet

It’s funny. I set out on this Bachelor of Computer Science program at UBC two years ago with the sole aim of being able to get a job at Facebook, an internship being a great route into this, and I almost didn’t even apply there. I wasn’t sure if a big company would be a good fit since I’d really enjoyed working at Axiom Zen for my first internship and wasn’t a huge fan of the bureaucracy of the big company I’d worked at before going into tech.

Facebook was the last internship offer I received, and it was actually the first offer I got with the Canadian company that motivated me(?) or maybe gave me the confidence, to just go for it. I mean, what better way to see what a company is like than do an internship there, and well, it’s Facebook. And if you fail, so what, you have options. Nothing to fear here. Move along.

The resolution

I’ll save the application details for another post perhaps, but once I received the Facebook internship offer, I accepted it pretty much immediately. Truth be told, I was leaning towards the real estate tech company until the very end because of the small-team vibe it offered and I could see real value in their product –  but when you’ve got the chance to see what software engineering is like at a tech giant, one whose products you use day-in-day-out, and which you set out to work for two years ago and is the reason you realized you needed a CS degree in the first place (the Tower of Hanoi question I reference in this post was one I tried and failed to solve at the end of Makers Academy and was a sample interview question on Facebook’s career page), you’ve got to take it.

What now? Well, I have a bunch of unpublished posts that I wrote over the last few months which I’d like to clean up and publish. A couple on how the rest of the academic term panned out, why a CS degree was necessary to getting an internship at Facebook (a bootcamp is DEFINITELY not enough, don’t believe what they tell you), and some things I’ve learned over the last 8 months or so about how to get an internship and then make the most of it.

I’d better get moving. My last term (at least for fulfilling my program requirements) starts at UBC in a few days.

 

Cold Storage in the context of Facebook

So one of the great things about returning to university, particularly university in North America, is the number of opportunities to connect with the leading tech companies. The giants like Facebook, Microsoft, Google, and EA all recruit here at UBC and that’s great in giving you something to aim for once you graduate. In part that’s down to proximity, all of the above companies have their main engineering office in North America and inevitably, that’s where they’ll recruit from. Last Tuesday, I attended a talk by one of Facebook’s engineers on cold storage. It’s a topic I knew very little about before but it’s hugely topical given the rate of data generation and the cost of storage so I thought it was worth a share

Cold Storage talk by FB

Cold Storage talk by Facebook – I “happened” to be near all the food

Disclaimer: This is based on my perception of (what seemed to me) a fairly complex talk so some of the details may not be correct and will almost certainly be a gross simplification of how they do things at Facebook but hopefully the general ideas still hold.

Without further ado…

The Problem: What is Cold Storage?
Cold storage has to do with the storage of data which is cold. We distinguish between hot and cold data by determining how easily accessible it has to be which is itself related to how frequently it is used. In the context of Facebook, hot data will be things like those photos you uploaded earlier this week or that last status update you made.  Cold data on the other hand might include those random photos from that night out (you know the one) that you uploaded three years ago and which no one commented on or ever viewed (how sad). Facebook has algorithms to determine whether data is cold or hot.

Since Facebook’s business revolves around their users’ data and none of it ever gets deleted unless you ask them to (i.e. you choose to delete some of your data), the issue of how cold data will be stored is a critical one. Storing cold data such that it is relatively accessible(cold data doesn’t need to be as quickly accessible as hot data since it’s used infrequently) and very redundant (i.e. backed up) is a critical issue, and that all has to be balanced against cost, especially when you hear that there are many millions of pieces of data being created each day by Facebook’s users and that all of it has to be stored for as long as their users want them (i.e. indefinitely).

And add to that, yet another consideration – speed. Or more specifically, the speed at which a failed drive can be re-built. When your operation is as big as Facebook, you end up having hard drives fail frequently and of course, these ideally need to be re-built faster than they fail. I think one of the facts we heard during the talk was that Facebook expects a hard drive to fail once every 18 minutes. And given that one of Facebook’s hard drives can hold up to an exabyte of data, that’s a lot of data that needs to be re-built, and pretty quickly too. So efficiency is another consideration when dealing with the problem of cold storage.

data center

A row in one of Facebook’s data centers

So how does a company like Facebook solve the problem of cold storage?

It gets a bit technical but try to stay with me. I’ll do my best to keep it as conversational as possible. But if you’re reading this, you probably want all the gory technical details!

The fundamental principles of data integrity and redundancy
We’re all familiar with the need to back up our data. Every day software like Time Machine on OS X or System Restore on Windows handles the nuts and bolts of that for you.

Once you look closer, it turns out that there are a number of ways to ensure data is sufficiently redundant. One easy but expensive way is to do full-scale replication. You could replicate all of the data bit for bit and store it in a different hard drive/computer/location. But doing that for any given file will mean a cost increase of nx where n is the number of copies of the file and x is the cost to maintain/store the file.

And also, storing a complete duplicate of a file on another drive isn’t particularly redundant. If that drive and the original one with the file goes down, that’s it. You’re done. No way of recovering your data.

It turns out to be much better to split files into chunks and have each chunk stored on different drives. This also has the added advantage of improved read and write times since lots of machines can perform a task quicker than just a few machines – the power of distributed computing.

But how do chunks solve the issue of the nx cost associated with duplicate copies? That’s where Reed-Solomon encoding comes in. This breaks up data into chunks where r chunks are the data (i.e. the actual data) and k chunks are for parity (this information tells us whether the file is intact). And the idea is that you can lose up to k chunks and still be able to reconstruct the file. For example if you have Reed-Solomon encoding of (10, 4) that means you can lose 4 chunks of the file, whether parity or data chunks, and still be able to reconstruct the file. That means full-scale replication and the big costs associated with is no longer necessary. Reed-Solomon encoding provides significantly improved redundancy at a fraction of the price (the exact extent depends on the values chosen for r and k).

Another stumbling block
That’s all very well and good but if different chunks of files are stored on different drives to ensure redundancy, how do we know where the chunks of a file are stored if we need to rebuild it due to a drive failure? Well, we’ll  need databases to keep track of that. But if a file of 3MB is split up into 14 chunks and a drives holding 1 exabyte of data fails, that’s a lot of database records to lookup and change once the data on that drive is rebuilt from each files’ respective chunks on other drives.

The solution to that is volumes. And these divided into either logical or physical. Both of which are basically just groupings of chunks. This means that instead of updating a record in a database to say that chunk103939 is now in physical volume 1, you just have a record saying that chunk103939 is in logical volume 1 (this would be in one table) and that logical volume 1 is in physical volume 2 (this would be in another table, and would be the one you would update when a physical drive fails).

So you have many orders of magnitude less records to update should a disk fail, i.e. you just have to change the records specifying which physical volume a logical volume is located on (remember that a logical volume represent many chunks of many files).

The final problem to solve then is where to locate volumes? You want volumes to be efficiently allocated across drives to ensure performance, i.e. you don’t want to put all your volumes onto one drive since that puts more load on the physical drive they are stored on while other drives spin idly ( distributed computing is not being effectively used)

You could approach this problem in a number of ways

i) Best fit solution
This involves looking at each volume and determining which one should be filled. It provides the best solution but is expensive.

ii) Random Allocation
Much less expensive and a reasonable solution, it involves randomly allocating volumes to drives.

It turns out that a variation of ii) is the most efficient and effective solution. Completely random allocation is bad in the sense that you will inevitably end up with some drives at maximum capacity while others are very empty. Better still is to random choose two allocation.

This involves randomly selecting two drives and then placing the volume in the more empty one. This results in a fairly even distribution of volumes across all drives, sigma, or the standard variation, decreases significantly. Interestingly enough, increasing the number of choices (random choose three etc.) doesn’t improve the efficient of the allocation significantly given the higher cost involved.

And there you have it, a whistle-stop your through cold storage. I hope I’ve given you a little taste of the problems in this domain because they are certainly fascinating ones.

And it being Facebook, the cold storage team got their project (implementing these principles) up and running in just 8 months.

Move fast, break things. I get that! Now that’s a company I’d like to work for.

Corrections to the technical stuff above on a postcard please (or just leave a comment below). I’ll work on taking better photos next time!