The Second Homework - Image Sharpening

Introduction

You were hired as part of the team developing a new image processing software (similar to Photoshop or Gimp). To get edge over existing programs and to attract professional users, your program is supposed to be super-fast. Your task is to develop one of the core operations - image convolution - as consequence your implementation must be as fast as possible. In simple terms the convolution replaces pixels in original image by weighted average of pixels in neighbourhood (averaging is done separately for each colour component).

To simplify and speed-up other operations, your software always keeps up-to-date brightness histogram for the image. Thus while you are applying convolution or immediately afterwards you have to recalculate brightness histogram.

Task

Your homework is to implement following convolution mask for sharpening the image. You are required to minimise time to process the image by utilising system caches in the most efficient way. The convolution mask for sharpening is following:

0	-1	0
-1	5	-1
0	-1	0

Pixels on the edges (first/last row and first/last column) are just copied from the original image. In other words if the convolution can not be computed due to mask being outside the image, the pixel is just copied.

The second part is to create a brightness histogram of output image and store values for the histogram into a file. For conversion from RGB to brightness use following formula:

Y = round(0.2126*R + 0.7152*G + 0.0722*B)

The bins for the histogram are defined by following table:

Bin:	values 0 to 50	values 51 to 101	values 102 to 152	values 153 to 203	values 204 to 255
Count:

Input Image

For simplicity, input image is in binary encoded portable pixmap format (PPM). Image in this file format always start with string P6 on the first line. On the next line there are two string values - width and height of the image in pixels. These values are separated by a white space (assume only space). On the next line there is maximal intensity recorded in the image (always 255). On the next line starts the pixel data - triplet of unsigned 8 bit integer for each pixel, each integer stands for red/green/blue component for this pixel in this order.

Two example images in PPM format follows:

vit_small.zip vit_normal.zip

The image file to process will be passed on command line (see parameters of main() function).

Program output

You will output two files from your program - a sharpened image in file named output.ppm and a histogram data in file output.txt. In histogram file, there will be values for individual bins (0-50, 51-102, …) separated by single space.

Evaluation Criteria and Points

The program you will submit will be evaluated based on

Number of accesses, cache hits and misses to L1 data cache
Number of accesses, cache hits and misses to L1 instruction cache
Number of accesses, cache hits and misses to L2 (common) cache

For evaluation purposes we will use following assumption on cache size: L1 data and instruction caches are 8-way set associative, 64 bytes per block of 32 KB each. L2 cache is 1MB in size, 16-way set associative with 64 bytes per block. In ideal case you program should be able to adapt to any cache parameters, but for homework it is OK to have these values fixed.

Points

1pt for submitting working program.
remaining 5 point will be awarded based on efficiency of your program. You can not get less than 0 points and more than 5 points here.
Final number of points is sum of above.

Points for efficiency will be awarded based on “Cost” of your program. Cost is calculated as follows:

Cost = AMAT_i*I_refs + AMAT_d*D_refs,

where

AMAT_i is average access time needed to read all the instructions: AMAT_i = HT_L1i + MR_L1i*(HT_L2 + MR_L2*HT_RAM)
I_refs is number of instructions read from instruction cache (memory)
HT_L1i = HT_L1d is time required (latency) to read data/instruction from L1 instruction/data cache - 0.6ns = 2 clock cycles.
HT_L2 is time needed (latency) to read data/instruction from L2 cache - 6.0ns = 20 clock cycles.
HT_RAM is time needed (latency) to read data/instruction from main memory - 143ns (63ns + 80ns) = 210 clock cycles + 80ns.
AMAT_d = HT_L1d + MR_L1d*(HT_L2 + MR_L2*HT_RAM)
D_refs is number of data read/written to data cache (memory)
Clock cycles above are calculated for CPU with 3.3 GHz frequency.

Give cost, your points will be determined based on following formula:

Efficiency_pts = 5*(1 - (Cost_Max - Cost_Min)/Cost)
where
Cost_Max=6
Cost_Min=1.4

Efficiency_pts will be limited to range 0 to 5 and rounded to nearest upper multiply of 0.25.

Way to submit your homework

The program has to be written as portable C or C + + code without use of external libraries. Input/output picture format is selected such way that reading and writing can be implemented in a few lines of C code. Code can be optimized and use Intel x86 architecture extensions to achieve better memory access pattern. Upload system compiles and run tests in a similar GNU/Linux operating system environment as is used during course seminaries.

To submit your homework, you have to use upload system: http://cw.felk.cvut.cz/upload/

You have to submit your source file (named main.c or main.cpp if you use C++ instead of plain C) and the source file has to be compressed using ZIP. All these requirements are due to limitations of upload system and you really have to put all your sources into one file.

To compile your program implemented in C, we will use following command:

gcc -mssse3 -g -O1 -Wall -Werror -std=c99 -lm -o main main.c

To compile your program implemented in C++, we will use following command:

g++ -mssse3 -g -O1 -Wall -Werror -lm -o main main.cpp

All homework will be tested for cheating (all cheating students will treated equally, even original authors).

Please note that we have strict rules for output.txt file (it will be compared character by character). It has to contain exactly five number separated by single space. After the last number, there must be no character (even no “new line”).

Deadline will be on April 22, 2016 at 11:29 PM CET.

Hints

Cost of your solution is fairly well reflected in the time needed to finish the convolution (the faster program means lower cost). To measure the time needed to finish your implementation, you can use C function called gettimeofday (see man gettimeofday). See the example below:

#include <unistd.h>
#include <sys/time.h>
...
struct timeval start, stop;
gettimeofday(&start, NULL);
...
gettimeofday(&stop, NULL);
double accum = ( stop.tv_sec - start.tv_sec )*1000.0 + ( stop.tv_nsec - start.tv_nsec )/ 1000000.0;
printf( "Time: %.6lf ms\n", accum );

Do not forget to link library rt during compilation by adding -lrt to the compiler command line (i.e. g++ main.cpp -g -lrt).

You can use cachegrind tool to evaluate how your program utilise the CPU caches. This tool outputs number of cache misses and cache hits during execution of your program. By default the cachegrind uses cache configuration of your computer. To set other parameters you have to specify them via command line arguments (–I1, –D1, –LL).

To run the cachegrind with the same parameters that will be used to evaluate your points, use following command line:

valgrind --tool=cachegrind --I1=32768,8,64 --D1=32768,8,64 --LL=1048576,16,64 ./a.out

You can also get more detailed analysis of your program's behaviour by using cg_annotate.

cg_annotate <filename> ~/hw2/main.c

where <filename> is path to the file with detailed output generated from cachegrind, please use absolute path (starting with / or ~) to your source code.

To verify output of your program, you can use this Testing Picture with source and transformed image and histogram.

Table of Contents