Should all callgrind bottlenecks be optimized?

Hey all,

I’d like to have some feedback from you. Consider this code:

    #include <iostream>
    #include <memory.h>
    using namespace std;
    struct List {
        List(int size) {
            begin = new int[size];
            memset(begin, 0, size);
            end = begin + size;
        ~List() {
            delete[] begin;
        int at(int i) const {
            return begin[i];
        int size() const {
    //         std::cout << "size called" << std::endl;
            return end - begin;
        int& operator[](int i) {
            return begin[i];
        int* begin;
        int* end;
    int main() {
        const int s = 1000000;
        for (int reps = 0; reps < 1000; ++reps) {
            List l(s);
            List l2(s);
            // version 1
            for ( int i = 0; i < l.size(); ++i ) {
            // version 2
    //         for ( int i = 0, c = l.size(); i < c; ++i ) {
                l2[i] =;;
        return 0;

If you run this through callgrind, you’ll see quite some time being spent in l.size(), the compiler doesn’t seem to optimize that away. Now, fixing this “bottleneck” is simple, look at version 2. That way, l.size() will only be called once and you’ll save quite some instructions according to callgrind.

Now, my first impression was: Yes, lets fix this! On the other hand, this optimization is not really that noticable in terms of user-experience. So my question is: Is it worth it? Should everything one sees in callgrind that is easily avoidable and optimizable (like the stuff above) be optimized?

I ask because QTextEngine e.g. doesn’t use the optimized version and I wonder whether I should create a merge request for that. According to callgrind the difference is noticeable: One of my testcases shows ~8% of the time being spent in QVector<QScriptItem>::size() (via QTextEngine::setBoundary()). In Kate the difference is even bigger with ~16% of the time being spent in QList<QTextLayout:.FormatRange>::size() via QTextEngine::format(). Hence I’d say: yes, lets optimize that. I just wonder whether it’s noticeably in the end.


EDIT : See this comment thread for an answer.

Attachment Size
kate.qtextengine.callgrind.bz21.02 MB
qtcreator.qtextengine.callgrind.bz2286.25 KB


Want to comment? Send me an email!

Comment by Anonymous (not verified) (2011-01-31 16:24:00)

There are two kind of debug builds: optimized and not optimized. The first is that RelWithDebugInfo, the other is Debugfull, IIRC. The need for non-optimized builds is when doing debug with gdb, as otherwise you will see funny jumps there and back in the code instead of linear progress from one line to another.

How somebody who cannot read optimized code output in Gdb is allowed to code in something like C++? It’s like giving a baby a loaded weapon with a hair trigger.

(There are cases where it makes sense to check also non-optimized compiler output, but if one cannot easily debug his code in compiled form by deducting when Gdb is showing inlined content and when shown values are bogus because compiler didn’t need to retain the variable anymore, one should really use some language suitable for his/her level of competence. E.g. Ruby & Python cannot be much slower than non-optimized GCC output.)

Comment by Matt Vogt (not verified) (2010-12-13 05:27:00)

I believe you should optimize this, but not necessarily because of the optimization benefit. There are two slightly different iteration patterns you’re comparing here, and all programmers should be able to read the difference between them inside ‘for’ constructs: a) find the bounds of iteration, then loop; b) loop over a sequence, testing for termination.

The example you have chosen trivializes the difference somewhat. It would be more obvious if you used the following instead:

    for (iterator it = something.begin(); it != something.end(); ++it) {
    for (iterator it = something.begin(), end = something.end(); it != end; ++it) {

The second is very different from the first if the loop potentially mutates the ‘something’ object as it operates.

Given that these are different operations, it should not be encouraged to think of one form as having a readability penalty over the other - both are valid and the difference between them should be recognised.

Comment by Andras Mantia (not verified) (2010-12-11 21:53:00)

I almost always optimize this kind of code. As in I write the code in a way it caches the result. This is also recommended for the end iterator, if you iterate in a loop (for somewhat different reason, but in the end it boils down to the same kind of code). See also… .

Comment by TheBlackCat (not verified) (2010-12-10 19:31:00)

I have two thoughts on this:

  1. If it is easy and quick to do, doesn’t hurt code readability or maintainability, is good coding practice, and makes a potentially noticeable benefit, why wouldn’t you change it? Is there some reason changing it would be a bad thing? If there are some reason to change it, and no reasons not to, the decision seems clear to me.

  2. It may not make a big difference on its own, but a lot of small changes like this would likely add up to big changes overall. So even if this change on its own doesn’t have a very noticeable performance impact, have a consistent policy of making changes like this probably would.

Comment by Milian Wolff (2010-12-10 20:28:00)

as was investigated below: there is no difference in release mode after all. So the change would only be of benefit for those (mostly developers) running un-optimized versions of the code.

Comment by Anonymous (not verified) (2010-12-10 11:05:00)

In my last experiment with callgrind in other KDE SC component, using this little (but very clear) optimization showed an improvement of only 1%. But there are over 400 instances of this code. Therefore, in the worst scenario, it will show a 40% speed gain. And probably, in the best scenario, only 10% speed gain.

Also, let’s not forget the impact on the CPU cache -> more noticeable.

I say yes. It is not hard to do, clean, an a good practice in the Java world (according to PMD).

Comment by scroogie (not verified) (2010-12-10 08:47:00)

I always try to avoid calculations and function calls in invariants. Of course premature optimization is a risk, but this is so simple and doesn’t influence readability, so that its okay in my book.

Comment by Kevin Kofler (not verified) (2010-12-10 01:04:00)

I always use the optimized version, or some variant thereof. It’s really bad to call size() in each loop iteration when you can call it only once.

Comment by Dakon (not verified) (2010-12-09 21:39:00)

Just for the records: your constructor is wrong. It must be

memset(begin, 0, size * sizeof(*begin));

Otherwise you get only the first size bytes initialized, not size elements.

Comment by Slava (not verified) (2010-12-09 21:36:00)

the code posted could be called simply a “bad practice”, because the current implementation does more job than it needs. Easy optimization like this makes code faster, easier to read/understand (?). So I would say keeping code “clean” in that sense helps to maintain code base in a better state (sometimes even if it doesn’t show up in the profiler). Algorithms/Hardware specific optimizations are a totally different story however

Comment by Thomas McGuire (not verified) (2010-12-09 21:21:00)

I wouldn’t change this in all loops, but if it shows to be a bottleneck that is visible to the user, then sure, go for it and make it faster!

@apol: QList::size() is defined in the header file, so a compiler can inline it. And once inlined, depending on the situation the compiler might deduce that the size never changes, and thus do the optimization itself. I didn’t test if the compiler actually does that in this case though. At least in Millian’s example code, I would be surprised if the compiled would not optimize the call to size() away in release mode.

Comment by Milian Wolff (2010-12-09 21:36:00)

See below, according to -fdump-tree-optimized it doesn’t optimize that. And this:

… but if it shows to be a bottleneck that is visible to the user, …

is exactly why I blogged about this to begin with. Is it noticeably? Who knows? According to callgrind it is noticeable. The question is how much that is visible to the user itself…

Comment by Robert Knight (not verified) (2010-12-09 20:47:00)

Hello - are you using optimized builds (-O2) of Qt, KDE and your application?

Comment by Milian Wolff (2010-12-09 21:34:00)

No, because you wouldn’t see the function (since debug symbols are stripped and it’s inlined) in that case. The basic problem will persist though, just compare the output of g++ -fdump-tree-optimized test.cpp using the file above. One version will call .size() on ever iteration, the other not.

Comment by JMN (not verified) (2010-12-09 22:02:00)

still, if you compile using -O2, the generated assembly looks pretty optimized to me :

        xor  eax, eax   # i
        mov  edx, DWORD PTR [ebx+eax*4]    # tmp79,* D.21429
        mov  DWORD PTR [edi+eax*4], edx   #* D.21441, tmp79
        add  eax, 1   # i,
        cmp  eax, 1000000  # i,
        jne  .L4   #,
Comment by Christoph Bartoschek (not verified) (2010-12-09 22:00:00)

You cannot compare the functions if you do not optimize. There are two basic rules for using callgrind:

  1. Only run callgrind on optimized code (-O2 or -O3)
  2. Leave the debug symbols in the executable to see meaningful lines (-g).

For GCC -g and -O2 are orthogonal. The code does not get slower by using -g. The option just adds debug symbols for the debugger or tools like callgrind. It does not harm runtime.

Especially in C++ it makes absolutely no sense to profile without optimizations turned on.

Comment by Milian Wolff (2010-12-10 17:44:00)

I didn’t know you can use both, debug symbols and optimization, thanks for that. I’ve compiled it this way and ran it through callgrind again, and indeed, both results have the same performance.

This in turn begs the question: Considering most developers use debug builds (not neccessarily for Qt, but definitely for my own projects like KDevelop/Kate), wouldn’t it help to fix code there? Sure, normal users won’t see a difference, but this should be noticeable for developers, right?

So the question can be rephrased: Is it ok to optimize for developers? ;-)

I see how this is probably not acceptable for Qt which usually is build in release mode though… I’ll keep it in mind for future code I write though. Thanks!

Comment by Robert (not verified) (2010-12-15 01:44:00)

Hi Milian,

if you want to optimize your loops, you shouldn’t use them at all! Each step needs an if-statement (is the end of the loop already reached?), which breaks the parallel pipelining of your cpu instructions flow, afaik. So would it be optimal to unfold the loop? Of course not, because of readability. The developers don’t want to have a fast dev-build, they want to read your code!

I think, best practice is to let the compiler do all the optimization and keep the own code simple. Symantics > Optimization (in the case the compiler can handle it for you).

Grüße, Robert

Comment by Christoph Bartoschek (not verified) (2010-12-11 13:27:00)

I my opinion one should not optimize specially for developers. I would only use optimizations that are also useful for release builds. In all other cases the optimization target is code clarity. In this case for example you create code that does not follow the convention and needs slightly more brain power to fully understand.

There is also a potential bug hiding in this scheme. As soon as someone copies such a loop and deletes/addes items from/into the container you are broken. In my opinion this danger weights more than the slight optimization.

Comment by mat69 (not verified) (2010-12-10 19:28:00)


Stuff shouldn’t be added to just make the debug-build faster. Debug builds aren’t supposed to be fast and if they turn out to be fast well that is nice but not a condition.

What you’d end up with would be implementing all optimisations the compiler does on its own and by that making the compiler’s job harder. Many of the optimisation “tricks” for C++ and C should not be used anymore as the compilers do it a lot better.

I suppose in most programs there is a lot to improve speed-wise that would help both builds, be it release or debug. Be it by occassionally (only where it really improves things) using QStringBuilder at the library level — note some platforms can’t compile that then IIRC — caching things, using threads (e.g. via QtConcurrent), …. Another easy way to speed up things a little is to use “QHash::find”, e.g. not constantly “myHash[aString]” but instead once “it = myHash.find(aString)”, yet that is only really useful if you’d call “myHash[aString]” a lot.

Still imo the first target of optimisations should be to find bottlenecks. Don’t just optimise small things, that would only make the code uglier in many cases and cost you a lot of time while it does not improve anything noticeable. A bottleneck could be multiple operations that use a lot of cpu-power or operations waiting for io (reading a file, sending lot’s of stuff via dbus …). You can find the first via callgrind, but not the later.

PS.: In fact that is just my opinion.

Comment by Milian Wolff (2010-12-10 20:38:00)

Well, I definitely don’t want to introduce mikro-optimizations like shifting instead of multiplying or multiplying instead of dividing. But looping like above is done quite a lot and at least future code should imo be written with the “also-optimized-in-debug-builds” version if the code is called that often. In QTextEngine this is the case. Looping over ten items won’t make a difference of course. And generally - most of the time you’d use foreach anyways.

And I find it interesting that you mention Hash[] - couldn’t the compiler optimize that as well to only look that value up once? Anyways, probably my next attempt :)

And to your last paragraph about bottlenecks: I only found this case above by using callgrind to find actual bottlenecks. 13% of time spent in a size() call is imo just that - a bottleneck with an easy fix. And I know from my time working on KDevelop how hard it is to find real bottlenecks that are hidden behind system calls or thread locking ;-)

Thanks for this very useful input from you.

Comment by TheBlackCat (not verified) (2010-12-10 19:43:00)

Does this mean that debug builds of KDE applications do not currently have compiler optimizations turned on? I am not sure, but I was under the impression a lot of distributions use debug builds. I may be totally misunderstanding, though.

Comment by Andras Mantia (not verified) (2010-12-11 21:56:00)

There are two kind of debug builds: optimized and not optimized. The first is that RelWithDebugInfo, the other is Debugfull, IIRC. The need for non-optimized builds is when doing debug with gdb, as otherwise you will see funny jumps there and back in the code instead of linear progress from one line to another.

Comment by Milian Wolff (2010-12-10 20:39:00)

They are usually build in RelWithDebInfo which is afaik optimization including debug symbols (see above).

Comment by apol (not verified) (2010-12-09 20:44:00)

There’s no way the compiler can optimize that. You should be able to modify the result of size in the body of the loop, so this value can’t be cached blindly.

If it’s such a bottleneck I guess it’s fine to optimize as long as the semantic doesn’t change.

Comment by Milian Wolff (2010-12-09 21:35:00)

The compiler could - theoretically - optimize because I only call const methods from List l in the loop. It doesn’t though.

Comment by Marcel (not verified) (2010-12-10 11:19:00)

I’d even say the compiler should optimize this. With that said, and if it really does not optimize with -O3 (some above argue it actually does), I propose: Let’s not change the code, but improve the compiler. (principally, of course. In practice, you can improve the code)

Btw, it’s the point of const methods to allow for such optimizations. There are only very few indications for const_cast, usually lazy caching. If you use it to introduce real changes, well, you can also write static_cast<MyClass*>(0)->foo() in legal C++.

Comment by Slava (not verified) (2010-12-09 21:37:00)

what about const_cast ?

Comment by TNorthover (not verified) (2010-12-09 22:02:00)

I suspect all bets are off once you’ve called const_cast (and it’s your own fault rather than the compiler’s).

The path I see this taking in an optimizing compiler is: a couple of inlinings (::at and ::size) followed by some optimizations that hoist constants out of loops (partial redundancy elimination or common subexpression and something else).

It’s all completely reasonable, and just depends on what optimizations have been implemented in your compiler (and what order they’re executed in).

Comment by Anonymous (not verified) (2010-12-09 21:28:00)

Really? What would happend if somone removed an element from l after entering the loop for the last iteration?

Comment by TNorthover (not verified) (2010-12-09 22:05:00)

Then you couldn’t optimize. Fortunately inlining List::at would prove that’s impossible in this case with some reasonable mental gymnastics.

Comment by David (GNU/Linux Supporter) (not verified) (2010-12-09 20:26:00)

I would say, yes, and emphatically so. My justification would be that each instance of such a situation as you describe might make little difference to the user experience but, compounded with other instances, a noticeable effect may well exist. Sticking to a best practice of keeping all code as well optimised as possible minimises such interaction and, should a snippet of code be reused elsewhere it makes that derivative have clean code to work from.

An interesting topic and thank you for raising it.

Comment by Gunni (not verified) (2010-12-10 00:34:00)

Thats exactly my opinion. If every really small optimization is made, it may add to a noticable. So every thing that can be optimized should be, if the fruit is as low hanging as this one.