Warning
This page is located in archive. Go to the latest version of this course pages. Go the latest version of this page.

HW 05 - Data searching and processing

Your task is to implement a program that will search a database of people and their personal data. Each person and their personal details are considered as one database entry, and as it is a very simple database, each entry is a structured text file (see overview below). Thus, your program should search through those files and process the read entries in the following way.

  • [A] Calculate the average salary of the processed entries.
  • [B] Each entry contains the date of birth of a person, so you should calculate a histogram of months of birth.
  • [C] The program will search for a person that has an n-th birthday on a provided day and month. The first found database entry should be printed.

For more details, see details below.

Your program should use the producer-consumer scheme for the database query by utilizing tools from the library pthread.h. In your main thread, you should create two pools of threads. The first pool of threads should represent the producers. The producers are responsible for reading the database content and providing it to the consumers through a memory buffer of the size given by a macro BUFFER_SIZE. When the buffer is full, the producers should wait for it to have available space. The second pool of threads should represent the consumers that should read the entries from the buffer and analyse them. When the buffer is empty, the consumers should wait for an entry to appear. The number of producers is provided by the macro PRODUCER_COUNT, and the number of consumers is provided by the macro CONSUMER_COUNT.

When there are no files left to be read, the producer threads should terminate cleanly and notify the consumers about the event, so they will not expect more data in a buffer. When any consumer finds the sought database entry (see point C above), it should notify all other threads about it, and they should stop and terminate cleanly. The consumer thread that found the entry should provide the entry to the main thread of your program, which should print it and then terminate cleanly. In case there is no such entry found that would satisfy the requirements (set in point C above), the consumer threads terminate cleanly, and the main thread is notified about the fact that no entry was found. Then, the main thread prints on the standard error output a message No entries found!.

Data processing details

Each personal entry contains the field Salary. The consumers should be able to collectively calculate the average salary of all the persons’ entries processed by them so far. Here, we are interested in the field Date of birth with the structure YYYY-MM-DD (see overview below). The histogram (table with the number of occurrences of particular values) will be formed from the month part. The consumers should be able to collectively create the histogram. The reference n-th birthday will be the first command line argument, and the day and month for the birthday search will be the second command line argument in the form of DD-MM with a dash separating the fields. E.g. ./main 40 23-02 to search for a person who will be 40 years old this year (2022) on 23rd February. Once the first entry with the personal details fulfilling this query is found, all created threads (producers and consumers) should stop and terminate correctly while releasing their unnecessary resources. The values calculated in points A, B and C should be given to the main thread of the program, which will print them on the standard output.

Technical requirements

The threads should always handle the resources in a decently safe manner, thus always closing the opened files and deallocating all dynamic memory. Create a Makefile for compiling your program. Separate your program into modules with safe headers. Functions should be commented on in English about what they do, their inputs and outputs. The main program should then print the calculated values as follows:

#Average salary: <33333>
#Month histogram:
+-----+-----+------+-----+-----+-----+-----+-----+-----+-------+-----+-----+
| Jan | Feb |  Mar | Apr | May | Jun | Jul | Aug | Sep |   Oct | Nov | Dec |
+-----+-----+------+-----+-----+-----+-----+-----+-----+-------+-----+-----+
|   0 | 456 | 1000 |  88 |   0 |   1 |  75 |  64 |   3 | 94687 |   4 |   0 |
+-----+-----+------+-----+-----+-----+-----+-----+-----+-------+-----+-----+
#Found entry:
Name: <John Doe>
Address: <Street 123, City, Country>
Phone: <123456789>
Date of birth: <YYYY-MM-DD>
Occupation: <company, position>
Salary: <40000>
The print formatting of the histogram is like the above, as the columns are months, and the row is the value associated with the month. The table should have a header with the abbreviation of the months. The number and the months should be aligned to the right with a single space from the right between the value/word and the | symbol. The values and words should be left-padded with spaces if needed; see the example output. Be careful also about the position of + symbol in the table borders.

Database overview

The database has the form of a large number of simple text files. The files are located in a directory given by the macro DATABASE_DIR, and the number of files is given by the macro DATABASE_SIZE. Each contains exactly one database entry of a form:

Name: <John Doe>
Address: <Street 123, City, Country>
Phone: <123 456 789>
Date of birth: <YYYY-MM-DD>
Occupation: <company, position>
Salary: <40000>
Each line is terminated by the new line character. Name entries are always single words. The address is a combination of three parts separated by commas. Phone entry is a sequence of nine digits. The date of birth has a predefined format with a dash as a separator. Occupation entry is a combination of two parts separated by a comma.

All the files have the same name format, which is their sequence number starting with one, and it is left-padded with zeros, e.g. file names 313.dat and 002.dat for a database with the size 513.

You can find sample database here.

Evaluation

The maximum number of points is 25.

  • 20th December - feedback deadline
    • Your submission uploaded by this date will be evaluated and given feedback.
  • 2nd January - soft deadline (max 25 pts)
  • 6th January - hard deadline (max 7.5 pts)
  • bonus - analysis of execution time with respect to the number of producers and consumers (max 5 pts)
    • Submit a plain text file with you analysis.
courses/be5b99cpl/hw/hw05.txt · Last modified: 2022/12/21 15:21 by ulricji1