Profiling in C/C++

Your own machine with Linux OS

Most comfortable way for you right now. Install only a few required tools from the repository (perf, hotspot, libopencv, libboost). MAC users can survive in this task as well, instead of perf and hotspot, you can use instruments or Clion for profiling.

Remote access to our server with all required tools installed

Usually, all required tools can be accessed from computers in the lab. This year, to support distance learning, we installed required software to our server and there are two possibilities how to work remotely with GUI applications:

Xpra-based access, which works over slower networks and from all operating systems. Additionally, it allows to reconnect to the running application when your computer disconnects for whatever reason.
X11 forwarding via SSH, which requires fast and low-latency network connection and is well supported only on Unix-based systems.

Using Xpra

Xpra is an open-source multi-platform persistent remote display server and client for forwarding applications and desktop screens. It gives you remote access to individual applications or full desktops. It is available for Windows / Mac OS X / Linux.

First, run xpra_launcher from command line:

xpra_launcher

Fill the following configuration:

Mode: SSH
Server: <username>@ritchie.ciirc.cvut.cz:22
Server Password: <CVUT password> or empty if you copy your public ssh key to the server first

And connect to the server. You can save and load your configuration.

After successful login, click in the right bottom corner to settings icon and then on Move. Move the window so that it is visible and press Default configuration. This should create applications panel for you.

Computers in laboratory accessible over Xpra

In case our server is overloaded, we will provide remote access to lab computers, however, it is a bit more complicated.

Computers in laboratory (probably not this year)

We have all required tools for all tasks installed in Debian available in the laboratory. To run proper system, select “DCE PXE Menu → DCE Linux (first item)” in the boot menu. Login into system by using your CTU username and KOS password. Your home directories can be accessed also remotely by using ssh: “ssh username@postel.felk.cvut.cz”.

Task assignment

Imagine that you are a developer in a company which develops autonomous driving assistance systems. You are given an algorithm which doesn't run as fast as your manager want. The algorithm finds ellipses in given picture and will be used for wheels detection of a neighbouring car while parking. Your task is to speed up this algorithm in order to run smoothly.

You probably need do following steps to achieve the desired speed-up:

Download program from git repository: git clone https://gitlab.fel.cvut.cz/esw/ellipses.git
Compile program (simply run make in ellipse directory)
Run program (./find_ellipse -h for help, example images are attached in repository, press q to quit in GUI mode)
Do profiling (hints below)
Make changes which will improve speed (code & compiler optimizations)
Upload a patch file into upload system and pass specified limit: The patch will be applied using the git apply command, therefore, the best way to generate the patch is the git diff command:
```
git fetch origin && git diff origin/master > ellipse.diff.txt
```
(txt file extension because of the upload system).

You are not allowed to modify the number of iterations and other parameters of the RANSAC algorithm!

Program requirements – if you want to compile the program on your own machine, you will need the OpenCV library (libopencv-dev package) and the boost library (libboost-all-dev package). If you don't want to install new libraries on your machine, you can connect our server via ssh (ssh user@ritchie.ciirc.cvut.cz) and work remotely.

Basic profiling techniques

How can we evaluate the efficiency of our implementation? Run time gives a simple overview of the program. However, much more useful are different types of information such as the number of performed instructions, cache misses, or memory references in respective lines of code in order to find hot spots of our program.

Measuring execution time

The easiest way to analyze program performance is to measure its execution time. There are multiple ways how it can be done in a C/C++ program:

the C time library, specifically the clock function. Note that in some systems (other than Linux), the clock resolution may not be as good as the options below.
More precision values can be obtained by using high_resolution_clock in chrono C++11 library.
On Linux, you can use directly the clock_gettime function (and system call), which is used by the above options (on Linux).

GProf

GProf is a GNU profiling tool based on statistical sampling (every 1 ms or 10 ms). It collects the time spent in each function and constructs call graph. A program must be compiled with a specific option and all libraries you want to profile must be statically linked. When you then run the program, profiling information is generated. Note that the resulting data is not exact. Shared libraries can be profiled with sprof.

Simulation using Cachegrind (Linux, Mac OS X)

Cachegrind is part of Valgrind simulation tool. It uses the processor emulation to run the binary program and catches all performed instructions, memory accesses and their relationship to source lines and functions in a program. The program can have linked shared libraries, doesn't need to be recompiled to be simulated. However, you probably want to compile with debugging info (-g option) in order to match correctly source code lines. In any case, simulation usually takes about 50 times more time than running on real hardware. Profiling data generated by Cachegrind and gprof can be virtualised simply by opening log file in kcachegring.

http://valgrind.org/docs/manual/cg-manual.html

https://kcachegrind.github.io/

If you are interested also in a relationship and exact event counts spent while calling functions, you can use Callgrind, which extends Cachegrind by adding this functionality.

Profiling using perf

Most modern processors have performance counters that can count various hardware events (clock cycles, instructions executed, cache reads/hits/misses, etc.). Linux perf is able to analyze a program based on these counters.

You can also use any hardware events listed in the appropriate reference manual. For Intel processors - Intel® 64 and IA -32 architectures software developer's manual: Volume 3B (Chapters 18 and 19) - available at https://software.intel.com/en-us/articles/intel-sdm.

If you get rubbish in your perf report, try specifying the event in the perf record (example: perf record -e cycles ./program). A call graph is also useful: perf record --call-graph dwarf -e cycles ./program.

A few examples of perf usage: http://www.brendangregg.com/perf.html

By default, using performance counters is not allowed without sudo privileges. You can enable access for non-sudo user pmc by running this command:

echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid

Hotspot

The perf tool has a GUI called Hotspot, that makes it easier to run the recording and analyze and visualize the data. You can run it via an AppImage package (don't forget to set permissions to run - chmod +x file) or build it yourself.

Handling performance counters directly from C/C++ program

If you are interested in performance counters, you can use it directly from your C/C++ program (without any external profiling tool). See perf_even_open manual page, or use some helper library built on kernel API (libpfm, PAPI toolkit, etc.)

http://man7.org/linux/man-pages/man2/perf_event_open.2.html

http://perfmon2.sourceforge.net

http://icl.cs.utk.edu/papi/index.html

Windows alternatives

We have no experience with Windows tools, however, there are a few free tools, for example, list on this page:

https://wiki.qt.io/Profiling_and_Memory_Checking_Tools

MS Visual Studio has profiler:

https://msdn.microsoft.com/en-us/library/mt210448.aspx

Another windows profilers

https://sourceforge.net/projects/lukestackwalker/ http://www.codersnotes.com/sleepy/

Also, Windows alternative to KCacheGrind – QCacheGrind:

https://sourceforge.net/projects/qcachegrindwin/

If you do not want install any of these tools, you can work remotely on our server via ssh. Putty and Xming are your friends.

How to optimize execution time?

There are many ways how to optimize your programs, you can

play with compiler optimizations https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html, https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html – e.g. compiler is able to generate code adjusted for your processor,
avoid dynamic memory allocations in computation (sometimes, you can reuse e.g. existing arrays),
try to precompute values, replace exact values by approximations, simplify equations if possible,
inline functions,
parallelize (not in this case),
…

Several tip for optimizations: https://people.cs.clemson.edu/~dhouse/courses/405/papers/optimize.pdf

Sample CMakeLists.txt for compilation in various IDEs

cmake_minimum_required(VERSION 2.8)

set(CMAKE_CXX_STANDARD 11)

set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS} -g -O0 -Wall")

project(find_ellipse)

find_package(OpenCV REQUIRED)
find_package(Boost 1.60 COMPONENTS filesystem REQUIRED )
include_directories( ${Boost_INCLUDE_DIR} )

aux_source_directory(. SRC_LIST)

add_executable(${PROJECT_NAME} ${SRC_LIST})

target_link_libraries(${PROJECT_NAME} ${OpenCV_LIBS})
target_link_libraries(${PROJECT_NAME} ${Boost_LIBRARIES})

g++ -g -O0 -Wall -std=c++11 -I/usr/include/opencv *.cpp -o find_ellipse -lboost_filesystem -lboost_system -lopencv_shape -lopencv_stitching -lopencv_superres -lopencv_videostab -lopencv_aruco -lopencv_bgsegm -lopencv_bioinspired -lopencv_ccalib -lopencv_datasets -lopencv_dpm -lopencv_face -lopencv_freetype -lopencv_fuzzy -lopencv_hdf -lopencv_line_descriptor -lopencv_optflow -lopencv_video -lopencv_plot -lopencv_reg -lopencv_saliency -lopencv_stereo -lopencv_structured_light -lopencv_phase_unwrapping -lopencv_rgbd -lopencv_viz -lopencv_surface_matching -lopencv_text -lopencv_ximgproc -lopencv_calib3d -lopencv_features2d -lopencv_flann -lopencv_xobjdetect -lopencv_objdetect -lopencv_ml -lopencv_xphoto -lopencv_highgui -lopencv_videoio -lopencv_imgcodecs -lopencv_photo -lopencv_imgproc -lopencv_core

Table of Contents