Profiling in C/C++

Using computers in the laboratory

We have all required tools for all tasks installed in Debian available in the laboratory. To run proper system, select “DCE PXE → Debian stretch 4.9” in the boot menu. Login into system by using your CTU username and KOS password. Your home directories can be accessed remotely by using ssh: “ssh username@postel.felk.cvut.cz”.

Task assignment

Imagine that you are a developer in a company which develops autonomous driving assistance systems. You are given an algorithm which doesn't run as fast as your manager want. The algorithm finds ellipses in given picture and will be used for wheels detection of a neighbouring car while parking. Your task is to speed up this algorithm in order to run smoothly.

You probably need do following steps to achieve the desired speedup:

Download program from git repository: git clone https://gitlab.fel.cvut.cz/matejjoe/ellipses.git
Compile program (simply run make in ellipse directory)
Run program (./find_ellipse -h for help, example images are attached in repository, press q to quit in GUI mode)
Do profiling (hints below)
Make changes which will improve speed (code & compiler optimizations)
Upload patch file into upload system and pass specified limit: Patch will be applied using git apply command, therefore, best way to generate patch is the git diff command:
```
git fetch origin && git diff origin/master > ellipse.diff.txt
```
(txt file type because of upload system).

You are not allowed to modify the number of iterations and other parameters of RANSAC algorithm!

Program requirements – if you want to compile program on your own machine, you will need OpenCV library (libopencv-dev package) and boost library (libboost-all-dev package). If you don't want to install new libraries on your machine, you can connect our server via ssh (ssh user@postel.felk.cvut.cz) and work remotely.

Basic profiling techniques

How can we evaluate the efficiency of our implementation? Run time gives a simple overview of the program. However, much more useful are different types of information such as the number of performed instructions, cache misses, or memory references in respective lines of code in order to find hot spots of our program.

Measuring execution time

Easiest program analysis is time measurements, which can be done by using C time library. More precision values can be obtained by using high_resolution_clock in chrono library (C++11) or Linux function clock_gettime (man clock_gettime).

http://www.cplusplus.com/reference/ctime/

http://www.cplusplus.com/reference/chrono/

http://man7.org/linux/man-pages/man2/clock_gettime.2.html

GProf

GProf is a GCC profiling tool, which is based on statistical sampling (every 1 ms or 10 ms). It collects time spent in each function and constructs call graph. A program has to be compiled with a particular option and all libraries, which you want to profile, have to be linked statically. Then, running the program will generate profiling information. Note, that the resulting data are not exact. Shared library profiling can be done with sprof (man sprof).

https://sourceware.org/binutils/docs/gprof/

http://man7.org/linux/man-pages/man1/sprof.1.html

Simulation using Cachegrind (Linux, Mac OS X)

Cachegrind is part of Valgrind simulation tool. It uses the processor emulation to run the binary program and catches all performed instructions, memory accesses and their relationship to source lines and functions in a program. The program can have linked shared libraries, doesn't need to be recompiled to be simulated. However, you probably want to compile with debugging info (-g option) in order to match correctly source code lines. In any case, simulation usually takes about 50 times more time than running on real hardware. Profiling data generated by Cachegrind and gprof can be virtualised simply by opening log file in kcachegring.

http://valgrind.org/docs/manual/cg-manual.html

https://kcachegrind.github.io/

If you are interested also in a relationship and exact event counts spent while calling functions, you can use Callgrind, which extends Cachegrind by adding this functionality.

Profiling using perf

In the most modern processors are present performance counters, which can count various hardware events (clock cycles, executed instructions, cache reads/hits/misses, etc.). Linux perf is able to analyze program using these counters.

https://perf.wiki.kernel.org/index.php/Main_Page

Moreover, you can use any of hardware events listed in proper reference manual. For Intel processors – Intel® 64 and IA-32 architectures software developer’s manual: Volume 3B (Chapter 18 and 19) – available from https://software.intel.com/en-us/articles/intel-sdm.

If you get rubbish in perf report, try to specify event in perf record (example: perf record -e cycles ./program). Also call graph is useful (perf record --call-graph dwarf ./program).

A few examples of perf usage: http://www.brendangregg.com/perf.html

Hotspot

The Linux perf GUI for performance analysis offers UI around Linux perf. You can download AppImage from https://github.com/KDAB/hotspot/releases (don't forget to set permissions to run - chmod +x file) or build yourself (https://github.com/KDAB/hotspot).

Handling performance counters directly from C/C++ program

If you are interested in performance counters, you can use it directly from your C/C++ program (without any external profiling tool). See perf_even_open manual page, or use some helper library built on kernel API (libpfm, PAPI toolkit, etc.)

http://man7.org/linux/man-pages/man2/perf_event_open.2.html

http://perfmon2.sourceforge.net

http://icl.cs.utk.edu/papi/index.html

Windows alternatives

We have no experience with Windows tools, however, there are a few free tools, for example, list on this page:

https://wiki.qt.io/Profiling_and_Memory_Checking_Tools

MS Visual Studio has profiler:

https://msdn.microsoft.com/en-us/library/mt210448.aspx

Another windows profilers

https://sourceforge.net/projects/lukestackwalker/ http://www.codersnotes.com/sleepy/

Also, Windows alternative to KCacheGrind – QCacheGrind:

https://sourceforge.net/projects/qcachegrindwin/

If you do not want install any of these tools, you can work remotely on our server via ssh. Putty and Xming are your friends.

How to optimize execution time?

There are many ways how to optimize your programs, you can

play with compiler optimizations https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html, https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html – e.g. compiler is able to generate code adjusted for your processor,
avoid dynamic memory allocations in computation (sometimes, you can reuse e.g. existing arrays),
try to precompute values, replace exact values by approximations, simplify equations if possible,
inline functions,
parallelize (not in this case),
…

Several tip for optimizations: https://people.cs.clemson.edu/~dhouse/courses/405/papers/optimize.pdf

Table of Contents