› profiling 
» VTune and KDE
Fri, 09/09/2011 - 21:09
Hey all,
been some time since I blogged last time. My TODO list is ever increasing and I took my day job at KDAB up again. Among others, I attended a marketing talk by Edmund Preiss. He actually made that marketing talk interesting, not least by his huge knowledge in the business, thanks to ~20 years of working for Intel. Probably the most important info I got out of it is this:
VTune is available free-of-charge under a non-commercial license
Yes, you heard right. Take these links:
Intel’s non-commercial offering
note this entry from the FAQ:
What does noncommercial mean?
Non-commercial means that you are not getting compensated in any form for the products and/or services you develop using these Intel® Software Products.-
you’ll need the serial number that gets send to you via email after registering for the license
install VTune and profile the hell out of KDE/FOSS software and improve it all!
speeding up KDevelop
Personally I did the latter for KDevelop the last two days, and the results are astonishing. I just tested the results from today and an unscientific time kdevelop -s lotsofprojects - wait until parsing finished - stop showed roughly 50% decrease in time, from ~12min to ~6min. Yes, a whopping 50% - try it out for yourself and see how big the gain is. Don’t forget to whipe the DUChain cache though (i.e. via setting the environment variable CLEAR_DUCHAIN_DIR=1).
Why VTune rocks
I’m a huge fan of the Valgrind toolsuite, but it is simply too slow for profiling some things. Like opening ten medium to big sized projects in KDevelop and taking a look at the parsing speed. This can easily take a few minutes, but in Valgrind it would take ages. With VTune on the other hand, thanks to it’s sampling based approach, I don’t really notice the performance delay.
Then you might have heard of the new perf profiling utility in the Linux kernel. It is also sampling based, but sadly requires special compile options on 64 Bit (-fno-omit-frame-pointers), and the UI is horrible, I haven’t found anything worthwhile with it so far…
VTune on the other hand has an incredible GUI, which makes profiling a joy. You can look at call stacks top-down or bottom-up, visualize locks and waits, easily find hotspots, … I’m blasted. Especially the utilities to look at multi threaded performance (of e.g. KDevelop) kills every single other performance tool I have ever tested. Oh and did I mention that you can attach to an app at runtime, analyze some thing, and detach again?
Seriously, Intel: You just found a new fan boy in me. Thanks for giving this tool away for free for us “I hack on this tool in my spare time, yet still want it to perform nicely” people :) And kudos to the VTune developers - I’m blown away by it!
I really hope more people in the KDE community will try out VTune and try to improve the performance of our apps, I bet there is lots of potential!
Pitfalls
There are some negative aspects to VTune though: First of all it’s UI is sometimes freezing. I wonder if the developers should not maybe spent some time on analyzing the tool itself ;-)
The biggest gripe though is that VTune does not work everywhere. I tried to run it on my Arch box, but sadly Linux 3.0 is not supported by VTune yet. It worked like a charm on two Ubuntu boxes with some 2.6.X kernel though.
This also means that I have no idea if, and how, VTune works on non-Intel CPUs. I think some of it works nicely. I did not install any of the Kernel modules for examples, which would be required for hardcore lowlevel CPU profiling. I think the same feature set I praised so much above, should hence be available on e.g. AMD CPUs. But well, this is left to be tested.
So, I’m now drinking a well deserved beer and look positively into the future of a fast KDevelop/KDE :)
bye
» Should all callgrind bottlenecks be optimized?
Thu, 12/09/2010 - 19:12
Hey all,
I’d like to have some feedback from you. Consider this code:
#include <iostream> #include <memory.h> using namespace std; struct List { List(int size) { begin = new int[size]; memset(begin, 0, size); end = begin + size; } ~List() { delete[] begin; } int at(int i) const { return begin[i]; } int size() const { // std::cout << "size called" << std::endl; return end - begin; } int& operator[](int i) { return begin[i]; } private: int* begin; int* end; }; int main() { const int s = 1000000; for (int reps = 0; reps < 1000; ++reps) { List l(s); List l2(s); // version 1 for ( int i = 0; i < l.size(); ++i ) { // version 2 // for ( int i = 0, c = l.size(); i < c; ++i ) { l2[i] = l.at(i);; } } return 0; }
If you run this through callgrind, you’ll see quite some time being spent in l.size(), the compiler doesn’t seem to optimize that away. Now, fixing this “bottleneck” is simple, look at version 2. That way, l.size() will only be called once and you’ll save quite some instructions according to callgrind.
Now, my first impression was: Yes, lets fix this! On the other hand, this optimization is not really that noticable in terms of user-experience. So my question is: Is it worth it? Should everything one sees in callgrind that is easily avoidable and optimizable (like the stuff above) be optimized?
I ask because QTextEngine e.g. doesn’t use the optimized version and I wonder whether I should create a merge request for that. According to callgrind the difference is noticeable: One of my testcases shows ~8% of the time being spent in QVector<QScriptItem>::size() (via QTextEngine::setBoundary()). In Kate the difference is even bigger with ~16% of the time being spent in QList<QTextLayout:.FormatRange>::size() via QTextEngine::format(). Hence I’d say: yes, lets optimize that. I just wonder whether it’s noticeably in the end.
Bye
EDIT: See this comment thread for an answer.
» Profiling Rocks - KDevelop CMake Support now 20x faster
Wed, 03/31/2010 - 01:24
I just need to get this out quickly:
We were aware that KDevelop’s CMake support was slow. Too slow actually. It was profiled months ago and after a quick look that turned up QRegExp, it was discarded in fear of having to rewrite the whole parser properly, without using QRegExp. Which btw. is still a good idea of course.
But well, today I felt like I should do some more tinkering. I mean I managed to optimize KDevelop’s Cpp support recently (parsing Boost’s huge generated template headers, like e.g. vector200.hpp is now 30% faster). I managed to make KGraphViewer usable for huge callgraphs I produce in Massif Visualizer. So how hard could it be to make KDevelop’s CMake at least /a bit/ faster, he?
Yeah well an hour later and two commits later, I managed to find and fix two bottlenecks. Both where related to QRegExp. Neither was the actual parser, instead it was the part that evaluated CMake files, esp. the STRING(...) function. So even if we’d used a proper parser generator, this would still been slow.
The first one was the typical “don’t reinvent the wheel” kinda commit which already made the CMake support two times faster for projects that used FindQt4.cmake, i.e. any Qt or KDE project. Not bad, right? Well, while I fixed that I saw that KDevelop tried to do some Regular expression replacement on the output of qmake --help, this could not been right, could it? With help of Andreas and Aleix we found the bug in the parser and that made the CMake support 10 times faster.
So yeah, CMake projects using Qt or KDE should now get opened a whopping 20 times faster in KDevelop :)
I really love KCacheGrind and Valgrind’s callgrind - again it proved to be the most awesome tool one can imagine! If you are interested in the callgrind files:
Note: with KCacheGrind from trunk you can open these compressed files transparently :)
» Massif Visualizer - now with user interaction
Sat, 03/13/2010 - 16:55
Just a quick status update: Massif Visualizer now reacts on user input. Meaning: You can click on the graph and the corresponding item in the treeview gets selected and vice versa. It’s a bit buggy since KDChart is not reliable on what it reports, but it works quite well already.
Furthermore the colors should be better now, peaks are labeled (better readable on bright color schemes, I’m afraid to say…), legend is shown, …
Now lets see how I can make the treeview more useful!
» Transparent loading of compressed Callgrind files in KCacheGrind
Thu, 03/11/2010 - 23:43
Hey everyone!
I just committed an (imo) insanely useful feature for KCacheGrind: Transparent loading of compressed Callgrind files. Finally one does not have to keep those Callgrind files around uncompressed, hogging up lots of space. And what is even more important: It’s much easier to share these files now, as you can send or upload them as .gz or better yet .bz2 and open them directly. KDE architecture just rocks :) So in KDE 4.5 the best profiling visualizer just got better :D
In related news: I’m spending my time as intern at KDAB currently by creating an application to visualize Massif. If you are interested, check the sources out on gitorious: http://gitorious.org/massif-visualizer
It’s still pretty limited in what it offers, yet is probably already more useful than the plain ASCII graph that ms_print generates:
This is very WIP but the visuals are somewhat working now. I plan to make the whole graph react on user input, i.e. zoomable, click to show details about snapshots, show information about the heap items that make up the stacked part of the diagram, …
Also very high on my wish list is some kind of interaction with the KCacheGrind libraries, to reuse it’s nice features like callgraphs, cost maps, etc. pp. you name it :) All these features that make KCacheGrind such an insanely useful application.
Oh and remember: Never do performance optimizations without checking the facts first ;-)

