The moment I realized I needed a CS Degree

A little more than two years ago, I was in London, UK and had just finished a 12-week hacker school. Rather than looking for a dev job during the course’s demo/graduation day (as is the norm at a hacker school), I was getting ready to move across the Atlantic to Vancouver, Canada to begin my Bachelor of Computer Science degree at the University of British Columbia. Here’s the story of when I realized a hacker school really wasn’t enough to get me where I wanted to go.

The story

It must have been something like Week 8 of the hacker school when I started thinking about applying for jobs and what my options would be at the end of it. I remember checking out a bunch of companies that used Rails in London and looking up sample interview questions on the careers section of their websites and Glassdoor. There were quite a few smaller independent firms that seemed open to hiring a hacker school grad, and their questions didn’t seem so bad.

At that point, I had just started to get comfortable with Sinatra and Rails and had probably built my first couple of web apps – a classifieds website for university students and a lowest unique auction site. And, as you do when you’ve just started to get the hang of a new field, I was thinking that I was getting pretty good – they call that the Dunning-Kruger Effect. I remember thinking that if things had worked out, I could have built a Facebook, I mean how hard could it be (building a scalable, secure and efficient system is hard as it turns out, and that’s ignoring the user acquisition and growth side of things).

But being the sensible person I am – I’m trying not to cringe as I write this post – I figured I should probably work there first. You know, get an idea of the finer points. Besides, I’d heard great things about the company from people who worked there. I’d even visited the office once before (the one at Page Mill Road I think) and it was pretty cool (free drinks, open plan office, Herman Miller chairs). So I checked out the careers section of their website and started trying to solve a sample engineering problem – back then Facebook had engineering challenges that you could solve to get contacted by a recruiter.

The sample challenge was to write a solution to the Tower of Hanoi.

Diagram of the Tower of Hanoi

Tower of Hanoi

As per Wikipedia, the objective of the puzzle is to move the entire stack of disks from one rod to another while obeying these rules

  1. Only one disk can be moved at a time.
  2. Each move consists of taking the upper disk from one of the stacks and placing it on top of another stack i.e. a disk can only be moved if it is the uppermost disk on a stack.
  3. No disk may be placed on top of a smaller disk.

If I remember correctly, the aim of the puzzle was to generate a sequence of moves which solved the problem.

Come again?

Yep, I had no idea where to even begin – and this was the sample challenge. The one that was meant to give you an idea of how to handle input and output, rather than being representative of the difficulty of challenges since the real ones are always a lot harder than the illustrative examples.

I remember doing some research on the Tower of Hanoi and discovering that it had to do with something called recursion – I only had a vague idea of what that was at the time. Most importantly though, I found out that this was a well-known problem taught in pretty much all CS degrees as part of their introduction to algorithms and data structures.

The more I read about the topic, the more I realized I didn’t know. My brief moment of (over)confidence in my abilities quickly evaporated and I figured out that knowing how to build a web application in something like Rails is not the same as being a Software Engineer – that building an MVP of a social network site is not the same as engineering one. The former can be done by following Michael Hartl’s famous Rails tutorial, while the latter requires knowledge of computer science and algorithms, hardware, growth, marketing, data science, etc, if the app is to stand any chance of success.

Doing the hacker school was great for figuring out that coding for 8, 9, 10, 11, 12 hours a day was something I really enjoyed, but most importantly, for learning how much more I still had to learn if I wanted to tackle really complex problems with code.

And so I looked around for CS degrees that took students with a previous degree. There were a couple of Master’s programs in the UK which seemed ok but a bit too short; most of them were only one year in duration and I didn’t think that was enough time to get a really thorough understanding of fundamental CS theories, plus they didn’t seem to have the top-tier tech companies recruiting from there.

The top CS schools in the US on the other hand, generally didn’t offer second degree programs. Some, like Stanford, didn’t admit students for second Bachelor’s degrees at all, while others like Waterloo required students to start from scratch and do a 4 year program.

And then I found out about the Bachelor of Computer Science (BCS) program at the University of British Columbia in Vancouver. It checked all the right boxes. It was shorter than a regular CS degree but not that short (advertised as 2 years, but 2.5-3 years in practice) and had good links to American tech companies (Facebook, Google, Amazon and Microsoft recruited its students) and had a decent reputation (something like 30 to 60 in World Rankings if you buy that sort of thing). And before I knew it, I was on a plane to Vancouver, Canada.

And that’s it. That’s how I ended up studying for a CS degree and how I kind of went full-circle this summer interning at Facebook – something I am certain I could not have done with just my experience at a hacker school, not by a long shot. It’s also been a while since I solved the Tower of Hanoi.

Here’s the code I wrote for it during my intro to algorithms class in my first year of my CS program – it’s probably a little rough around the edges but I remember thinking at the time I wrote it and solved the problem, wow, I’m really getting somewhere.

 

#include <iostream>
#include <string>

using namespace std;

void moveDisks(int, string, string, string);

int main(int argc, char* argv[])
{
  if( argc != 2 ) {
    cerr << "Usage: " << argv[0] << " n" << endl;
    return -1;
  }

  int n = atoi(argv[1]);
  moveDisks(n, "A", "B", "C");
  return 0;
}

void moveDisks(int n, string from, string transition, string to)
{
  if(n == 1)
    cout << "Move disk from peg " + from + " to peg " + to << endl;
  else {
    moveDisks(n-1, from, to, transition);
    cout << "Move disk from peg " + from + " to peg " + to << endl;
    moveDisks(n-1, transition, from, to);
  }
}

Hack the North

This time last weekend, I was on the other side of Canada in Waterloo, ON, hacking away. Maybe it has something to do with the distance traveled to get there, but it seems like a long time ago. I almost feel like I’m moving backward taking this weekend easy (and by easy, I mean getting more than 6 hours of sleep total from Friday to Sunday).

Hack the North is Canada’s largest student hackathon and this year’s inaugural event held last weekend had about 1000 students attending from countries all across the world. Generous sponsorship meant flight tickets were significantly reimbursed 80%+ of the cost, and so students from as far away as Amsterdam and Shanghai were in attendance.

There wasn’t any specific theme to the event, other than to build something cool that worked – no one said as much but I surmise that’s the aim of every hackathon. Lots of cool hardware was available to work on as well, from Myos to Pebbles and Oculus Rifts.

Getting to the event from Vancouver was an event in itself, a crazy early start, then waiting at the airport on the other side for the shuttle to Waterloo, and then the trip on the shuttle itself. Who knew that the traffic in Toronto, where everyone flies into, would be so bad.

There was a little dinner left when we arrived and then we got to hear some amazing speakers open the event. We had Chamath Palihapitiya who was formerly in charge of growth at Facebook and is now in VC, starting the fund The Social+Capital Partnership  He spoke about startups and gave realistic advice about what life in tech is like. Chamath was extremely confident and charismatic, and it’s easy to say that he is one of the best orators I’ve ever heard. Check out his talk here (the video is still not up but I’ll update this post when it is!).

image

The hacking started at midnight and my friend Daniel and I got to work building a travel optimization app. It uses Yelp’s API to get a list of attractions in a city and Routific’s routing optimization API to provide an efficient route to visit the attractions in a single day if one exists. A quick way for tourists to determine how much they can get the most bang for their buck on their vacation.

There were chips and soda aplenty to keep people powered through the event, and there was even a small sleeping area. Unfortunately the sleeping area wasn’t really too well organized, it was brightly lit and far too small so I guess they didn’t expect many hackers to sleep much during the event. Daniel and I ended up finding chairs in a random room on the Waterloo Campus on Saturday morning for a couple of hours of sleep.

Saturday night/Sunday morning was probably the most interesting. People were definitely starting to feel the burn by that point and this was the scene at 4am

image

image

Every team got a chance to pitch their app in 100 seconds in front of a selected group of the judges, so there was definitely something to aim for during the hackathon. It’s all a bit of a blur but I spent most of Sunday morning trying to finalize the general flow and look of the app – I am definitely not a designer.

In terms of the stack, we built our app using Angular and had a Rails back-end for interfacing with the various APIs. Neither of us had used Angular before so that definitely made the hackathon interesting to say the least. We figured if you can’t try out some new tech in a hackathon, when else can you!

One of the things we found is that Angular, a JavaScript front end framework, is very domineering. You either use it completely for the front end, or prepare for some unexpected behavior if you try to mix in some regular JavaScript in there. We definitely ended up finding ourselves in the latter position, as we moved to get things working even if that meant doing things in less than the Angular way.

Another thing I learned this weekend was about OAuth, an authentication protocol which we used to communicate with the Yelp API. Since it uses tokens and secret keys, it wasn’t possible to have just a front end, we needed a back-end as well to prevent exposing our keys. This is different from something like the Google Maps API which we used as a map and for geocoding in our app (since Yelp doesn’t always return an attraction’s latitude and longitude) where client exposed keys are OK.

By the time hacking stopped at 10am, I’d been up for more than 24 hours, with little sleep before that period(!), so I managed to get a quick nap before we pitched in front of the judges.

Our app lacked the design polish of other teams, but the judges found our app pretty interesting, as did the engineers from Yelp who we pitched to separately (main prize vs API specific prizes basically). I thought we did decently for a team of two who were using a different stack than normal and it was great to find out later that Yelp had awarded us the prize for best app to use their API at the event!

We are getting 4 leap motions between the two of us. I’ll make a post once those arrive in the mail! Definitely interested to see what can be done with the hardware. Hardware hacks are not something I’ve really explored before.

But yeah, what an experience. Daniel and I work at the same company as coop software engineers, Axiom Zen, and it was great that we got the Friday off to travel to the hackathon and that we got so much support from them. They were pretty stoked that we managed to bring home a prize as well, using an API from a company they are helping accelerate. It’s great to work somewhere which gets as excited about things like hackathons as you are.

So what next? Work is super busy, there’s always lots to do but I’m enjoying it – more on that soon. I saw some interesting apps built in Node.js at the hackathon so that’s on the agenda this weekend! I’ll post if I build anything half decent 🙂

And if you’re wondering how you can get involved in hackathons, ChallengePost is a great website to start off with. They list hackathons going on around the world, both offline and online.

Ahhh summer school…

And by ahhh, I really mean AH! The first day of summer school started today and I’m already ready for the term to be over. It’s partly due to the insane amount of cover letters I’ve been writing in my applications for internships, but also because lectures during summer school are two and a half hours each. Trying to keep your concentration for such a long amount of time is difficult to say the least. Even so, I’m pretty excited about the two courses I’m taking this term.

 

CPSC 310 – Introduction to Software Engineering 

The prof in my Introduction to Software Engineering class is really charismatic and that’s definitely a help in keeping your attention when you’re trying to separate out your waterfalls from your spirals (these are two types of software processes). On the downside, I, and quite a lot of other people, failed to persuade the prof to let us code the class project in our language of choice. The project for the course involves parsing a publicly available dataset and plotting it in google maps (or equivalent), maintaining a database which users can add and edit records, and integrating social media authentication. Sound simple? I thought so.

Having done quite a bit with Rails, I thought the class project would be a perfect excuse to get better acquainted with some newer, and highly desirable, technologies. Specifically, Node.js. Everyone loves Node these days because it is asynchronous rather than strictly sequential in how to executes code, and this makes for more scalable applications. Node means the server doesn’t have to wait for a particular request to be completed or method to return, it can just go on doing what its doing and accommodate the request as and when it comes in.

Oh Node! (that’s really meant to be Oh no. Get it?)

Unfortunately, the summer term is so short that the prof wasn’t buying what I was selling – it’s a great way to learn new technologies! I’ve done a hacker school! I have a portfolio – at the end of the day, summer classes are a logistical nightmare. Trying to help a hundred or so students get their environments set up, teaching them software processes, and getting them to build a web app in a group without killing each other, is tough. I get that. And so, instead of using Node, we’re all going to have to use …. wait for it….. Java with the Google Web Toolkit.

I hold nothing at all against the professor – she agreed that something like Django or Rails would be infinitely more fun, and of course there’s the irony that Google Web Toolkit just converts Java to JavaScript anyway.. – but the rules are the rules. I’m just happy that we’ve got someone to teach the course this summer! Finding good professors, make that any professor, that can teach software construction is HARD.

Anyway, I still get to learn how Java works on the web – and of course, I can always find time.. somewhere.. to learn Node on my own – and that’s ok. If anything, this is good preparation for the real world where you have to use a specific technology, no matter how inefficient 🙂

 

CPSC 221 – Basic Data Structures and Algorithms

My first lecture in this class was pretty interesting. It was mostly a high level overview of why algorithms are important with a general intro to the concepts of arrays, lists (linked lists), and how a data structure might be good for one thing – e.g. arrays are good for binary searches compared to lists – but not for another – e.g. arrays do not handle insertion well since they are fixed size data structures. I didn’t learn anything brand new today but it was good to get an idea of where the course is heading.

One of the things I really liked was when the prof said that learning algorithms is the difference between you and some kid in high school who knows how to hack in python. To take it further, I guess knowing how to get an array sorted (my_array.sort) is not the same as being able to sort an array (i.e. understanding the algorithms behind Quicksort and Merge Sort).

Who cares? An example: ArrayList vs Lists

Choosing the wrong data structure can be catastrophic. The example we heard in class today was of an ArrayList and why you wouldn’t use it in a mission critical and time sensitive environment. An ArrayList may be a foreign concept to you if you’ve only ever coded in a language like Ruby. In Ruby, there is no need to declare a specific size for your arrays (although you can, and then those values can be set to a default), they just resize themselves automatically as things are inserted/deleted from them (the exact implementation of how this is done, in C which underpins Ruby, is pretty complicated so let’s leave it at that – I may need to come back and come up with a better example). This means that what are called arrays in Ruby are really ArrayLists in a language such as Java, i.e. self-resizing arrays. Don’t forget, arrays in the traditional sense are of fixed size.

Back to our example, we wouldn’t want to use an ArrayList in a mission critical system because resizing the underlying array of an ArrayList takes time. Time you don’t have. Can you imagine if you had a really badly implemented ArrayList, and it had 1 million elements in it and the underlying array (of size 1 million) was full, and you inserted a new element into it? Well, then you’d expect the array underlying your ArrayList to double, and now you’ve used up lots of time in copy elements from your old, smaller array, to your new bigger array, just so you can insert a single new element. Oh, and you’ve also used up lots of memory in the process. Guess what? Your mission critical system, let’s say a rocket ship, didn’t have that memory available for you to use. It crashed.

Or maybe the memory was available but in the time it took to resize the array, the rocket’s reverse thrusters were fired too slowly. Guess what? It crashed.

I am psyched for this class. Not least because it’s going to be needed to get through a lot of the interview questions that get thrown at undergrad CS students applying to internships. And who doesn’t like solving complex problems!

One other thing, we need to do the labs for this class in C++ and seeing as most of us don’t know C++ and neither are they going to teach it in class, I’d better go learn C++. This feels like the real world already.

Second Term of my CS Degree at UBC

Do you ever have those moments when you can’t decide whether time has passed quickly or agonizingly slowly? I’ve just finished the second term of my CS degree at UBC and am about to start summer school in a couple of days. That means 3 hour long lectures, sometimes two in a day, and lots and lots of labs. I’ll be applying for co-ops/internships at the same time so summer is definitely going to be busy! Saying, that I’m looking forward to getting down to the more technical and advanced  CS classes. One complaint so far has been the heavy enforcement of pre-requisites here at UBC, which has meant not being able to do as many CS classes as I’d have liked. C’est la vie. That hasn’t stopped me from going to learn things like algorithms on my own with Coursera and a textbook or two. Yes, I am keen.

Anyway, I thought I’d write a little bit about the courses I’ve just finished this term and about how they’ve fitted in with my learning to code (better – I’d like to think I can code now!).

The classes I took this last term were:

 

CPSC 210 – Software Construction

One of the apparently easier CS classes here at UBC, deceptively so, I’d say. The course is taught in Java and is really the first time students are expected to pick up a new programming language on their own, with the focus of the course primarily on object oriented design principles. I say the course is deceptively easy because it’s one of those potentially “wafflely” courses, where you can test decently even without fully grasping the principles behind object-oriented design.

I’ll be the first to admit that I took a laid back approach to the course initially. Having learned Ruby in the past meant things like instance variables and constructors weren’t new to me and so away I went, and the first couple of weeks were a breeze. As the course went on and as I started to look at learning about some fundamental algorithms and data structures, I realized that I wasn’t learning Java nearly well enough, especially if it was going to be the language that I would be likely to interview in when it came time for applying for co-ops/internships. Plus, Java and Ruby are pretty different as far as object-oriented languages go. Java is statically typed, which means once you declare the type of a variable, that’s it. That variable’s type can never change, and so that introduces interesting issues like casting. Plus, there were also things I’d never thought about before like dispatching and how that fits in with inheritance, also what interfaces and abstract classes (things unRuby-like) are.

The course finished with an Android app project. While the finished product, an app to get bus times and locations, was pretty amazing – I actually use it day to day – it was disappointing that we only implemented a small part of its functionality. The Android skeleton of the app including the map function and its graphics were all provided to us, we just had to make the API calls to the bus company to get the data and then display it for the user as a list or as markers on the map. Not exactly trivial, but not the full package either. It was a shame that we didn’t get to learn more about Android and take full ownership of the project. To put it another way, while I might say in an interview that I’m more familiar with transport APIs and parsing data, I wouldn’t say that I “built” a shippable Android app in this class.

We also looked at some design patterns which are approaches to common programming scenarios (more on that in another post), and that laid the foundations for the software engineering course (CPSC 310) which I’m going to be doing this summer.

All in all, a decent course. One which is easy to not get the most value out of, but I’d like to think having a solid grasp of things like interfaces, inheritance, polymorphism, dispatching etc, will stand me in good stead come interview time.

 

STAT 302 – Introduction to Probability

I really enjoyed the intro stat course I did in my first term so decided to go one step further with an elective in probability. Super interesting course and we got to learn about the key discrete and continuous distributions, and concepts like conditional probability. Had to dust off my double integrals knowledge at one point so it was definitely math oriented. I ended up doing pretty well in the course and it’s definitely something which CS in the future, particular as I hope to go more into data mining and machine learning (think predictive capabilities).

 

MATH 221 – Matrix Algebra

Another course which is super related to computer science and a requirement for the machine learning and data mining course here at UBC. The idea is that you can represent a system of equations in matrix form and then solve it through a technique known as row reduction. It can be as simple as solving

2x + 6y = 14

x + 7y = 15

which you probably already know how to solve using the elimination technique of simultaneous equations, but a more systematic approach is taken in matrix algebra when you have many more variables.

We ended the course with the idea that you could model populations using matrices and figure out what happens to them as years go by. The examples were simple and involved a predator and a prey, the numbers of the two being related to each other by a set of equations.

Super interesting stuff but unfortunately, there was too much to cover on the syllabus and my prof took her time teaching the course (not a bad thing by any means since she explained concepts thoroughly) but the consequence was rushing through the practical applications component, which turned out to be equally testable, and so the exam was pretty interesting (and scaled a lot!).

 

ECON 311 – Principles of Macroeconomics

My program lets me choose a couple of non-cs electives and I thought I’d change it up a bit with something a little more discursive and analytical. Economics and geopolitics is something I find pretty interesting (just me?) and so it was a great opportunity to think a little deeper about the validity of different approaches to economic booms and busts/recessions. Not strictly related to CS – it’s not going to make me a better coder – but it is definitely going to help how I think about business and the role of the government in managing economies. I’m not saying I can predict exactly how the government is going to deal with the next recession (economies are cyclical so recessions are inevitable) but I can talk through the different approaches and what some of the favoured approaches are here in Canada.

 

And onwards to the summer term…

I know grades aren’t everything but I’m really happy with how things have gone this last term, and more than anything, am excited to get stuck into the more meaty CS courses. There are two summer terms and for the first summer term I’m doing:

 

CPSC 221 – Basic Algorithms and Data Structures

The interview course, as in the stuff you’ll need to learn to pass the interviews. It covers the building blocks of CS theory. Shame that I’ll be just beginning the class when I’m interviewing… Guess it’s time to read a little further ahead so I don’t miss out on the more interesting co-op jobs!

 

CPSC 310 – Introduction to Software Engineering

More design patterns with the focus on collaborating to deliver a software project in a formalized approach. It’ll be the first time CS students get to grips with web apps, databases and version control – most students use Java and Google Web Toolkit for their app – but since it’s not my first time around the block, I’m going to try and use the course as an excuse (if I needed one) to learn a new language. Node.js is really interesting because of its asynchronous nature so why not!

 

EA Careers Event

Today was one of the days I really valued being back at university. Sure, the education process at university can be a bit slow sometimes, what with the system of pre-requisites and other formalities, but it makes up for that with the relationships it has with big name companies.

Electronic Arts is one of the the big names in gaming. From casual games like The Sims and The Simpsons: Tapped Out, to more hardcore games like Battlefield 4, EA has got everything. Tonight’s event was a chance for students in the Vancouver area to find out more about internships and jobs at their Burnaby campus which focuses mostly on sports titles (ever heard of a game called FIFA?).

Due to its location, EA is big recruiter of students from UBC, where I’m currently a student, and it’s definitely somewhere I would love the chance to intern at. We started the event with a tour around the campus and it was… cool! I was impressed by the sports facilities on site. There was an indoor basketball court, a gym filled with TechnoGym machines, couple of yoga studios, and a full-sized football pitch.

EA's Football Pitch

EA’s Football Pitch

Sorry for the bad photo. We weren’t meant to take photos anywhere else on campus besides a couple minutes on the balcony. Even in the darkness, that’s one cool football pitch.

The talks were pretty interesting as well and I got one of the questions I’ve always wondered about answered.

How does EA engineer the same game for multiple consoles?

It turns out that there is a core EA tech team which develops software that sit on top of the various SDK (software development kits) for the different consoles. This software acts as a common interface that abstracts away some common tasks. For example, the software might have the feature of opening a file, to the user this is just a simple function, but in the background, the software is interfacing so as to perform the same task (but in different ways) for each console’s SDK. This means that engineers working on a game can focus on the game itself, rather than trying to make it work for each individual console.

There you go, you learn something new everyday. And here are the two main languages used at EA

Key languages  for a career at EA

C++ and C#

Five things I’ve noticed about studying in Canada

Moving to a new country is always pretty exciting and it’s interesting comparing it to the places you’ve been before. So here are five things I’ve noticed about the Canadian education system compared to the British one, at least based on my experience at UBC.

    1. Professors let you ask questions here. A good thing since they’re more approachable and make the lectures engaging. Not so good, when you want to tell someone to just google the answer and not waste class time
    2. Lectures are for reviewing material and not learning them. Throughout the whole of my first degree, I mostly went to lectures expecting the professor to teach me the material. Not going to happen. Teaching is for school. University lectures are for complementing your own self-study and consolidating your learning through another medium. Ok, this is true at every university!
    3. Textbooks in Canada are expensive. Every time I buy a textbook, I think about the extension to the author’s home mansion which I’m part-funding. It probably has to do with the bigger geographical area and increased logistical costs.
    4. There are multiple knowledge checks during a course. There are so many quizzes, pre-labs, pre-lecture, assignments, mid-terms, and exams in my diary right now, that I don’t even know where to begin. It’s definitely a new way of learning where material is examined regularly and over a shorter period of time (versus the UK where you mostly only get examined material in one go at the end of the year). A shorter feedback cycle is a good thing but not when there isn’t time to regroup and learn from your mistakes, something which has happened in a couple of classes this year.
    5. UBC uses technology. There’s Blackboard, Piazza, Coursera, and a whole bunch of learning tools being used at UBC. That is definitely a good thing. I like that there are forums for every class, and even better, that the professors respond to them (ok, only the CS ones are staffed properly)! This is how education should be (although, one of my classes is being run in parallel with a virtual course offering on Coursera – a free virtual course offering…). Certain classes also use iClickers, which are little devices that let you select an answer to a question posed by the teacher. Your performance answering these class questions, also incorporating the fact you attended them, is factored into your grade. Most lower-level classes have large class sizes and so professors don’t take attendance and just use iClickers instead.

Cold Storage in the context of Facebook

So one of the great things about returning to university, particularly university in North America, is the number of opportunities to connect with the leading tech companies. The giants like Facebook, Microsoft, Google, and EA all recruit here at UBC and that’s great in giving you something to aim for once you graduate. In part that’s down to proximity, all of the above companies have their main engineering office in North America and inevitably, that’s where they’ll recruit from. Last Tuesday, I attended a talk by one of Facebook’s engineers on cold storage. It’s a topic I knew very little about before but it’s hugely topical given the rate of data generation and the cost of storage so I thought it was worth a share

Cold Storage talk by FB

Cold Storage talk by Facebook – I “happened” to be near all the food

Disclaimer: This is based on my perception of (what seemed to me) a fairly complex talk so some of the details may not be correct and will almost certainly be a gross simplification of how they do things at Facebook but hopefully the general ideas still hold.

Without further ado…

The Problem: What is Cold Storage?
Cold storage has to do with the storage of data which is cold. We distinguish between hot and cold data by determining how easily accessible it has to be which is itself related to how frequently it is used. In the context of Facebook, hot data will be things like those photos you uploaded earlier this week or that last status update you made.  Cold data on the other hand might include those random photos from that night out (you know the one) that you uploaded three years ago and which no one commented on or ever viewed (how sad). Facebook has algorithms to determine whether data is cold or hot.

Since Facebook’s business revolves around their users’ data and none of it ever gets deleted unless you ask them to (i.e. you choose to delete some of your data), the issue of how cold data will be stored is a critical one. Storing cold data such that it is relatively accessible(cold data doesn’t need to be as quickly accessible as hot data since it’s used infrequently) and very redundant (i.e. backed up) is a critical issue, and that all has to be balanced against cost, especially when you hear that there are many millions of pieces of data being created each day by Facebook’s users and that all of it has to be stored for as long as their users want them (i.e. indefinitely).

And add to that, yet another consideration – speed. Or more specifically, the speed at which a failed drive can be re-built. When your operation is as big as Facebook, you end up having hard drives fail frequently and of course, these ideally need to be re-built faster than they fail. I think one of the facts we heard during the talk was that Facebook expects a hard drive to fail once every 18 minutes. And given that one of Facebook’s hard drives can hold up to an exabyte of data, that’s a lot of data that needs to be re-built, and pretty quickly too. So efficiency is another consideration when dealing with the problem of cold storage.

data center

A row in one of Facebook’s data centers

So how does a company like Facebook solve the problem of cold storage?

It gets a bit technical but try to stay with me. I’ll do my best to keep it as conversational as possible. But if you’re reading this, you probably want all the gory technical details!

The fundamental principles of data integrity and redundancy
We’re all familiar with the need to back up our data. Every day software like Time Machine on OS X or System Restore on Windows handles the nuts and bolts of that for you.

Once you look closer, it turns out that there are a number of ways to ensure data is sufficiently redundant. One easy but expensive way is to do full-scale replication. You could replicate all of the data bit for bit and store it in a different hard drive/computer/location. But doing that for any given file will mean a cost increase of nx where n is the number of copies of the file and x is the cost to maintain/store the file.

And also, storing a complete duplicate of a file on another drive isn’t particularly redundant. If that drive and the original one with the file goes down, that’s it. You’re done. No way of recovering your data.

It turns out to be much better to split files into chunks and have each chunk stored on different drives. This also has the added advantage of improved read and write times since lots of machines can perform a task quicker than just a few machines – the power of distributed computing.

But how do chunks solve the issue of the nx cost associated with duplicate copies? That’s where Reed-Solomon encoding comes in. This breaks up data into chunks where r chunks are the data (i.e. the actual data) and k chunks are for parity (this information tells us whether the file is intact). And the idea is that you can lose up to k chunks and still be able to reconstruct the file. For example if you have Reed-Solomon encoding of (10, 4) that means you can lose 4 chunks of the file, whether parity or data chunks, and still be able to reconstruct the file. That means full-scale replication and the big costs associated with is no longer necessary. Reed-Solomon encoding provides significantly improved redundancy at a fraction of the price (the exact extent depends on the values chosen for r and k).

Another stumbling block
That’s all very well and good but if different chunks of files are stored on different drives to ensure redundancy, how do we know where the chunks of a file are stored if we need to rebuild it due to a drive failure? Well, we’ll  need databases to keep track of that. But if a file of 3MB is split up into 14 chunks and a drives holding 1 exabyte of data fails, that’s a lot of database records to lookup and change once the data on that drive is rebuilt from each files’ respective chunks on other drives.

The solution to that is volumes. And these divided into either logical or physical. Both of which are basically just groupings of chunks. This means that instead of updating a record in a database to say that chunk103939 is now in physical volume 1, you just have a record saying that chunk103939 is in logical volume 1 (this would be in one table) and that logical volume 1 is in physical volume 2 (this would be in another table, and would be the one you would update when a physical drive fails).

So you have many orders of magnitude less records to update should a disk fail, i.e. you just have to change the records specifying which physical volume a logical volume is located on (remember that a logical volume represent many chunks of many files).

The final problem to solve then is where to locate volumes? You want volumes to be efficiently allocated across drives to ensure performance, i.e. you don’t want to put all your volumes onto one drive since that puts more load on the physical drive they are stored on while other drives spin idly ( distributed computing is not being effectively used)

You could approach this problem in a number of ways

i) Best fit solution
This involves looking at each volume and determining which one should be filled. It provides the best solution but is expensive.

ii) Random Allocation
Much less expensive and a reasonable solution, it involves randomly allocating volumes to drives.

It turns out that a variation of ii) is the most efficient and effective solution. Completely random allocation is bad in the sense that you will inevitably end up with some drives at maximum capacity while others are very empty. Better still is to random choose two allocation.

This involves randomly selecting two drives and then placing the volume in the more empty one. This results in a fairly even distribution of volumes across all drives, sigma, or the standard variation, decreases significantly. Interestingly enough, increasing the number of choices (random choose three etc.) doesn’t improve the efficient of the allocation significantly given the higher cost involved.

And there you have it, a whistle-stop your through cold storage. I hope I’ve given you a little taste of the problems in this domain because they are certainly fascinating ones.

And it being Facebook, the cold storage team got their project (implementing these principles) up and running in just 8 months.

Move fast, break things. I get that! Now that’s a company I’d like to work for.

Corrections to the technical stuff above on a postcard please (or just leave a comment below). I’ll work on taking better photos next time!

Part II: Chapter 1.1 University, again.

It seems like only yesterday I was graduating from Makers Academy and yet two and a half months have passed since I was last enjoying the unending supply of soft drinks and ping pong games at their office in Shoreditch.

Since then, I’ve

  • been invited to the August cohort’s graduation day (time flies!)
  • moved to a new country
  • returned to university

Did that last one surprise you? Let me explain.

When I finished my first university degree (I studied Law), I told myself that I wasn’t ever going back to formal education. It always seemed too structured, particular, and divorced from the real world. In a world where Google reigns supreme, does anyone really need to know the case in a footnote of a 1576 page textbook, much less be judged for that knowledge (or lack thereof in my case)?

And yet, here I am, back in university; I’ve moved halfway across the world to a new city, in a new country (Canada), to study Computer Science. Depending on your point of view, that’s antithetical to the course I did at Makers Academy. After all, hacker schools are pitched as an alternative to studying for a CS degree.

So why, after graduating from one of these hacker bootcamps, did I decide to return to university?

In short, I went into Makers Academy with the desire to build things and I came out with the desire to build bigger things AND understand why they work.

Hacker bootcamps are all about learning how to deploy web apps, getting something up and running which can solve simple to moderately complex problems on a small scale. They can get novices from 0 to 60 pretty quickly simply because they focus on teaching exactly what you need to do that and nothing more (there isn’t time). That’s inevitably going to be at the expense of understanding of the underlying technology. Do you need to know how search algorithms work to implement search functionality for your PostgreSQL power Rails app? Nope. But you do need to understand them if you want to build more efficient programs; to be able to bend the technology to do precisely what you need it to do and nothing more; to move away from being dependent and reliant on other people’s libraries/gems/work and create powerful tools of your own.

My current area of interest is data mining and search. The popular “big data fact” that goes around often is that 90% of the world’s data was produced in the last two years, which begs the question, well what are we going to do with it and how? This question is something which companies like Palantir are grappling with right now, on behalf of government organizations and retail conglomerates alike. Whether it’s interpreting the data to prevent a natural disaster or to turn a bigger profit, this is a field which is moving really quickly and which has lots of opportunities for growth. And if people say there are a shortage of software engineers, then there is a drought of data scientists and software engineers that really understand data.

If big data is “the next big thing” (indeed, it’s probably already arrived), I want to be a part of it and that why I’ve returned to university. You can’t begin to solve problems in that domain without understanding things like algorithms, data structures and statistics. Sometimes you’ve got to bite the bullet, move half-way across the world, and do something you said you’d never do again, just so you can pursue an ambition, a dream maybe. So here goes, I am now a Bachelor of Computer Science student at the University of British Columbia, Vancouver, Canada. Things are getting serious.