To the experienced programmers out there, the subject of this article will be immediately apparent. To others who are a bit more … excitable, it may seem rather risque. To those who watch a lot of television, it may simply be a reference to a certain warehouse retailer (I won’t bother giving them a Google boost). But, there are those of us who don’t have Computer Science degrees, and have never been told what the Big O notation is, and the Wikipedia article isn’t much help unless you’re a mathematician (I’m not). Thankfully, I finally had a big “Oh!” about the Big O, and I thought I’d share how I see it, for those of you who may not get what everybody’s talking about.
You should only continue reading if you’re confused when you see something like this: “I just took a function from O(n^2) to O(n)!!” If that sentence made perfect sense, feel free to ignore the rest of this article. Otherwise, here’s what you’re probably thinking:
- It looks like a function, but what’s O()? What’s it do?
- What’s n? Where does it come from? What’s its value?
- What should that function return?
- How is it relevant to the other function you’re talking about?
Well, in a nutshell, it’s a measure of how well an operation performs against large sets of data. Imagine a music player like iTunes. Sure, everything is nice and fast when you only have one track in your library, but what about when you have a hundred? A thousand? A million??? (Don’t forget, September 19 is Talk Like a Pirate Day.) The Big O notation is a concise way to communicate how well something scales to large datasets like that. Things like adding and removing tracks, searching the library and even just loading the application can all be affected by large amounts of data. So it’s important to be able to identify how well each piece performs under pressure and to communicate that performance to others. If only more of us understood how it works.
First, O(). For whatever reason, it’s always called O, but that makes it easier to spot, and makes it handy to call it Big O notation. Whenever you see it, you can be fairly certain it means what I’m about to reveal. But what is it, really? It’s a kind of imaginary function that has something like the following definition:
def O(n): return t * n
Oh, great. Thanks, Marty, you just defined a function by adding another undescribed variable. Worse, you didn’t even have the decency to give it a docstring, so I still have no idea what it all means.
True, but what’s important so far is that you get that image in your head. I’ll describe it in more detail in the following list, but seeing that function definition should help you remember the whole thing. It certainly helped me when I finally visualized it like that. So now we have three things to define:
t—The time it takes for an operation to process one item. This may be different from one system to another, due to things outside your control, but for the purpose of analyzing its performance across different sets of data, consider it a constant. If a function takes one second to process one item, t=1. If it takes two seconds, t=2. Get it? Simple.
n—A number based on the number of items in the dataset. The number of tracks in your library, rows in your database, items in your cache, friends in your social network, etc. Notice that this is only based on the number of items. The actual value passed in may get modified beforehand, and I’ll explain more about that below.
The return value of O()—The time it takes to process a given number of items. Because you can pass in any number of items from one to infinity (at least theoretically), this return value tells you how well it will scale from small to large datasets.
Notice above where I said that n is only based on the number of items in your dataset. In reality, you’ll often see things like O(n^2), where n gets modified before being passed into O(). In these cases, the n you see in the function call is the true number of items in your database. It gets modified according to how well the application performs. This is why O() is really just an imaginary function: you can’t just pass it a number and have it tell you how well the function performs, you have to figure that out for yourself.
Again, it’s really all about communicating performance to others, rather than determining performance automatically. Thankfully, I believe there are some applications out there to test such performance against various types of datasets, but I don’t know what they are or how they work, so I’ll leave those for another article some other time. Instead, I’ll explain a bit about what the various modifications to n mean and how to use them.
The easiest form to understand is O(n). That means the performance of the function is directly proportional to the number of items in your dataset. If it takes one second to process one item, it’ll take 100 seconds to process 100 items, and so on. Depending on the operation you’re analyzing, this can be an acceptable mark to meet. It’s not the best score possible, but it might be the best that can be achieved for you particular case. No article can tell you the best you can do, unfortunately. That’s very dependent on the task a function is meant to perform.
The next step down from that is O(n^2). That is, n-squared. This means that if processing one item takes one second, it will take 10,000 seconds to process 100 items (that’s 100 squared, see?). That’s nearly seven hours! Clearly this is unacceptable in most situations, though hopefully it takes far less than a second to process a single item anyway. This score often occurs when a function iterates over a list (that is, a variable-size dataset) twice to do what it needs to do. That would mean that adding a single item to the list would not only add an item to be processed, but another whole iteration over the list. Bad news. Of course, O(n^3) is even worse, but you should get the idea about how the rest of the downward spiral goes.
Thankfully, there’s a great score to shoot for: O(1). Notice that n doesn’t even get passed in here (though you could look at it as n^0 if you like). That means that the function’s performance has no relation to the number of items in the dataset. If it takes one second to process one item, it’ll still take just one second to process a hundred items, a thousand, even a million, yeh filthy pirate. In theory, I suppose it might be possible to get a better score than this, but that would mean that the system actually gets faster with more records. There are likely very few cases where that would even be possible, and if you’re still reading this article, you probably won’t ever run into them (I certainly don’t expect to).
The article that finally gave me my realization on this subject was Eric Florenzano’s recent post on cache invalidation performance. He managed to come up with an O(1) approach that should scale extremely well on large systems. He does a good job of explaining how he analyzed his code, but I’ll describe a couple variations of his approach and how they would be scored. So, if you haven’t read his article yet, please do so now, so the rest of this section makes sense.
Really, go read it. I’ll wait.
Okay. First up, rather than tagging a set of keys so they can be invalidated together, imagine that you only set out to solve the first problem: not knowing what keys have been set. You set up a list of cache keys and append to it each time you add an item to the cache. Then, when you want to indvalidate, you look through the whole list for some that match the user. For whatever reason, you didn’t think to also check for the start value at the time, so you have to cycle over the list of matching users to find a matching start value. Then you look over that list for a matching end value. Even though you’re technically not scanning the whole list of cache keys every time, this would still likely be considered O(n^3) because it depends on the size of three different factors. Not good.
Now imagine that you’ve seen the light and you check for all three values on the first pass. Now the time is dependent solely on the size of the list, because checking each item takes the same amount of time no matter how the list is laid out. This is a significant improvement, but you still have to consider and ignore a lot of entries before you get to the one you want. Because this performance depends on the number of items in the cache, it gets an O(n) rating. Eric’s makes it all the way up to O(1) by using a known key to target to the exact entry required, without having to scan a list to find it.
The preceding explanation was intentionally simplified for the sake of understanding. In the real world, there are a few other things worth noting that can affect how you analyze and score various functions.
First, t isn’t always time. Really, it’s more about resources, so it could be memory, disk space, network bandwidth, etc. Time is a useful aggregate, though, because it takes many of those factors into account. Of course, Eric’s O(1) approach is a bit heavy on memory usage, relying on the cache system to clean things up on that end to maintain speed.
That works right into the next point: performance isn’t always completely under your control. In Eric’s example, his function is O(1), but if he’s using a cache server that doesn’t implement an O(1) keymap, it may still be O(n) (looking through a big list) when actually deleting the cache key. Cache systems are built for speed, so I doubt this would be the case, but it’s definitely a consideration whenever your functions rely on other code that you don’t control.
On a related note, keep an open mind about n, because a number of things can be considered “items”. Going back to the music player example, the length of an individual song can affect how long it takes to seek to a specific position within that track, as can the distance from the start that you’re trying to seek to. Basically, anything that can get bigger can be a candidate for n. Just don’t start mixing them. Take them one at a time to analyze performance properly.
Also, note that in the first faulty example I mentioned, calling it O(n^3) is a bit misleading, given my description of how O() could be defined. Because it doesn’t loop over the whole list three times, it would really be somewhere between n and n^2 for most cases, if you just look at the total time it takes on various datasets. But the reason it’s still marked as n^3 is that the score isn’t really a measure of speed—or any other resources—in the end. It’s about the complexity, and even though that generally impacts speed, it’s an important distinction. By looping over three different lists, each of which is based on the size of the entire list, there’s the potential to get as bad as n^3 depending on the dataset, so it’s marked as such.
Likewise, you won’t find scores like O(n^2+3) for functions that have a one-time setup penalty that’s unaffected by the size of the dataset, even if the rest of the function is. It would just be marked as O(n^2) to describe the overall complexity. In fact, to my knowledge, the Big O notation is only ever used with squares. The Wikipedia article linked above suggests that other modifications like the +3 above would be written off as “negligible”.
I certainly plan to start looking at my functions with a new eye in light of my recent understanding of this topic, and I suggest you do the same. Sure, optimization shouldn’t always be your primary concern, but if it’s on your mind while you’re writing (or updating) a function, you run good odds of finding a reasonably simple way to boost its score.