Micro-optimization is stupid

I tend to frequent the website stackoverflow.com. It’s a fantastic website. It allows knowledge to be shared in a unique way. The only problem is, some people have no idea what they are talking about. If there are enough people who agree with these misguided notions, well then these incorrect answers get up-votes. And the cycle of mis-information repeats. It isn’t too dissimilar from the various types of incorrect information regarding 32-bit machines and 4GB of RAM. This isn’t a problem with stackoverflow.com, it’s a problem with people who claim things to be the case without actually testing if what they are saying is true. It happens time and time again. You may be wondering why the title of this blog entry is “Micro-optimization is stupid” when I haven’t even mentioned that yet, well wait no more, here it is. Someone had asked a question which had a simple answer on SO. He was trying to insert a pointer to an object into a list which contained actual objects. I am guessing he is usually a java programmer (this isn’t an attack of java programmers, just that java’s syntax for creating references to objects looks very much like the broken c++ he was trying to use). Someone took it upon themselves to not only give advice on how to correct this trivial error, but also gave some optimization advice for iterating over the list. He recommended something like this:

    for(std::list<Point>::iterator it = l.begin(), end = l.end(); it != end; ++i) {
        /* whatever */

as opposed to this:

    for(std::list<Point>::iterator it = l.begin(); it != l.end(); ++i) {
        /* whatever */

The reality is that there isn’t a huge amount of difference between the two. The goal of the posters code is to avoid std::list::end() from being called after every iteration of the loop. I felt compelled to point out to the poster that the optimization recommendation was misguided as any sane compiler will nicely optimize that loop so the std::list::end() is only called once. In addition to that std::list::end() is typically an O(1), so even if it were called multiple times, it wouldn’t make a difference. You’re trading one memory fetch for another, no real difference.

Of course he argued, people who don’t know what they are talking about usually do :-P. He started claiming that since the compiler doesn’t know that the size is constant,  it has to recalculate the size each time, and that this extra function call would cause cache misses and such. See the problem here? There is enough techno-babble to sound convincing! But don’t worry folks, I was determined to demonstrate my point. So I responded with a few corrections. Most importantly,  size() != end(), so even if it did need to recalculate size() each time, the size has nothing to do with it.  In addition, certainly if the loop only uses member functions which are const it certainly can know that the list hasn’t changed.  (he of course retorted that while the end()/size() function is const, it’s result is not, which is just silliness, since end() won’t change the list, if all other functions don’t change the list, well then end() will be the same next iteration!). Finally, I sent him a link to some code which looks like this:

    // program 1
    #include <list>
    #include <iostream>

    int main() {
        std::list<int> l1;


        std::list<int>::iterator end = l1.end();
        for(std::list<int>::iterator it = l1.begin(); it != end; ++it) {
            printf("%d\n", *it);

    // program 2
    #include <list>
    #include <iostream>

    int main() {
        std::list<int> l1;


        for(std::list<int>::iterator it = l1.begin(); it != l1.end(); ++it) {
            printf("%d\n", *it);

And asked that he compile them both with g++ -O3 and look at the results. Not just assume what some other jackass told him was correct. I had already done this, and knew that the result on the latest g++ is identical. At this point I had a good feeling, knowing I had prevailed… until he responded with this:

GCC 4.01 with g++ -O3 -S produces different assembly outputs and different binaries with vectors and lists. GCC 4.2 produces identical output for lists but different output for vectors. The OP uses a vector, my comment stands.

Seriously?! This guy just couldn’t admit being even slightly wrong. To be clear, the OP used a list, not a vector. So at this point he was just making things up without taking the time to read. In addition, I checked and the “difference” between the two was more or less cosmetic. Doing the same work in slightly different order (but not repeatedly calling end(), which was the original concern). Finally, justice came and people started to down-vote his answer. He edited his post to remove the incorrect advice he gave and the down-votes were removed. None of this would have happened if he hadn’t gotten fixated on silly little micro-optimizations. When will people learn? Tricks like this may have made a difference in the early days of c++, but modern compilers are surprisingly good at optimization. Let them do their job so that you can do your job! At this point in time, every competent developer should know that it is best to do things the simplest, most obvious way first. This is better in many ways, most importantly in maintainability. Everyone can agree that simple and obvious code is easier to understand. Then if after profiling you find that a particular segment of code is not performant enough, feel free to go bananas with optimizing the little things in that particular section.

6 thoughts on “Micro-optimization is stupid

  1. WRS

    After reading your post my first thought was, yeah you’re right but why not do it that jerk’s way in case it does matter. In fact I thought I remembered running into an instance where it did matter… map I think it was. So unlike the friend in our story above, I decided to code up the two examples and run them through gprof so I could see exactly how many times end() was called in each.

    Long story short, they were EXACTLY the same when using optimization. Without optimization, there was a penalty for calling end() each time… it is not a O(1) but O(lg n) operation.

    All-in-all, Evan I only have one issue with the blog… you need to post more often. As always, good stuff!

  2. Evan Teran Post author

    Indeed without optimizations the compiler does the obvious thing which is terribly slow. In fact, much of the STL is often slow when there are no optimizations. Fortunately though current c++ compilers are pretty damn impressive with optimizations 🙂

  3. Chris

    I prefer this:

    std::list::iterator it = l.begin();
    const std::list::iterator end = l.end();
    while (it != end) { /* whatever */ it++; }

    I don’t do this much, but it is possible too:

    void PrintSomeItems(std::list::iterator& it, const std::list::iterator& end)
    while (it != end) { /* whatever */ it++; }

    PrintSomeItems(it, end);

    This function can be templated to accept a vector too for more awesomeness.

  4. Kevin

    There is an excessively tiny difference:

    std::list::end will return something from the std::list, it is inlined and in
    GNU world it looks like this:

    { return iterator(&this->_M_impl._M_node); }

    and underneath, without looking to closely, the type iterator is basically a
    wrapper over a pointer, so no big deal.

    I still have not gotten to the difference yet: if the std::list object is passed by
    pointer or reference, then the value looked up from std::list::end() is not
    on the stack, where as

    std::list::iterator i, end;

    are on the stack. So in _theory_ what the poster said could avoid a cache miss.
    Because of side effects, the compiler cannot cache the return value to std::list::end,
    unless the contents of the for-loop contains no function calls (or all function calls
    are inlined function and those themselves follow that rule and all do not
    touch the std::list, but even then that is putting alot of faith in the optimizing compiler).
    But even then, the memory location where the value of std::list::end is located will
    be put into the cache anyways after the first call, so unless the contents do enough
    jazz to push it out of the cache, then there won’t be a cache miss anyways.

    For std::vector, the story is actually the same, internally the end of the container is stored,
    not the size, so repeated calls to std::vector::end are the same story, so you _might_ get
    a cache miss in doing it, with the same jazz as noted for std::list::end.

    Lastly, the boost macro BOOST_FOR_EACH, uses saving the begin and end iterators as
    well, so for such loop constructs it is bad luck to add/remove elements from the
    container. Generally speaking, one usually cannot go wrong in emulating the behavior
    of boost.

  5. Kloster

    You mention Java programmers, but that’s the way most C++ programmers look to C programmers (who themselves may look bad in the eyes of asm programmers). And, by “programmer” I don’t mean those doing R&D by merely exercising their cut & paste abilities.

    Optimizing requires to know how things work – in reality. And the more you have layers of abstractions (like the huge Java or C++ libraries) the less you can be sure of what is going on (implementation bugs or enhancements, compiler version, runtime library, OS kernels, etc. not to mention hypervisors).

    Being an old dog (35 years of coding), I remember the days where people were beating (poorly designed) asm code with (near-optimal) portable C code – those days are gone for real because of the complexity of the ever changing underlying layers… the the facination for “modern” technologies (Java, C++, Python, etc.) that make it impossible to predict how your code will run on your customers’ machines.

    Surprisingly, optimized code still rules – when done right: in the absence of certainty about how things really work in the floors below, write code which is as concise as possible… once generated as machine code. The idea is that the less a computer will have to do the faster the job will be done. Think CPU caches, cache synchronization, etc. (small is beautiful).

    That does not mean using huge STL runtime libraries. That means, do-it-all-yourself when it needed.

    Anybody telling another story is simply missing the point.

  6. Evan Teran Post author

    @Kloster, I agree with the majority of what you say. But there are some aspects which I think you get wrong.

    1. I was not slighting Java programmers at all. I was simply referencing how his syntax indicated that he typically developed in Java and was trying to apply what he knew (incorrectly) to C++.

    2. Comparing the difference between Java and C++ to the difference between C++ and C is a pretty large exaggeration. Anything you can can write in C you can write in C++, pretty much exactly the same way. Additionally, with the exception of RTTI and exceptions, there is no “runtime” cost to using a feature of C++ (classes are just structs and functions with different syntax). C++ can be just as “bare metal” as C if you do it right.

    3. About abstraction layers. Sometimes but not always, they can actually make things more efficient. Let’s look at an example… std::swap in C++.

    Certainly this is an abstraction, but let’s look a little closer. It’s an abstraction that indicates purpose. That’s an important detail. because it allows library writers to put special sauce in there. Perhaps the library writers will specialize it so that when you are swapping ints, it will call a compiler intrinsic which will boil down to simply doing an x86 xchg instruction between two registers. While this can be done in C, it’s easier in C++ because C++ allows specialization of a function based on the types given.

    If that’s done, std::swap will be just as fast as fine tuned assembly, but has much more expressiveness because it’s called swap. You can look at code like this: std::swap(a, b) and know with complete confidence what it is doing. Lower level languages lose this ability because of the lack of abstraction. So we end up with code that is more clear, easier to maintain and just as fast. Sounds like a “Win,Win,Win” :-).

    4. Finally, I don’t know if you misspoke, but when you say “That does not mean using huge STL runtime libraries” that’s a very misleading statement. STL is not a “runtime” library, and has no “runtime” cost at all beyond the cost of the algorithms themselves, literally zero. Its size is irrelevant because you just use what you want, the parts you don’t use have no cost at all, the parts you do use have a small compile time cost (which I would wager is less than the time cost of hand rolling ASM).

    5 (bonus point). C++11 offers a new abstraction which avoids this issue in the first place. Now he can write:
    for(Point &point : l) { /* some code here */ }. This is yet another example about how abstractions can result in clearer and easier to read and write code while sacrificing no speed at all.

    Don’t get me wrong. I love optimizing code. I love it when people spend the time to tune things as much as they can. But the reality is that modern compilers are extraordinarily good as optimizing languages like C and C++ to the point of being competitive with hand written ASM.

    The moral of the story is the same as my original post. Profile, then you’ll know where to spend you time, regardless of what language you are developing in.

Leave a Reply

Your email address will not be published. Required fields are marked *