courses:osw:cp1 [CourseWare Wiki]

Checkpoint 1 (max 20 pts)

Deadline 15. 11. 2020

Goal

Create a semi-automated data pipeline, which transforms the data sources from Checkpoint 0 into RDF. Each data source will keep its own, separate data schema. Crucial part of the checkpoint is data quality check and eventual cleaning.

Deliverable

source code for the data pipeline (Scrapy, SPARQL, s-pipes, OpenRefine, etc.) for creating RDF datasets out of data sources
the RDF datasets (outputs of the data pipeline) obtained from the data sources defined in Checkpoint 0
a short description (1-2 page extension of the report from Checkpoint 0) describing the data quality check, data pipeline, its limitations and benefits, together with a UML class diagram depicting a schema for each dataset.

Details

Data quality check

Before setting up a pipeline, double-check that data contain all the information you need and that it contains it in the full extent. Try to check the credibility of data. Who is its originator? Are the data reference? Are they up-to-date? Answering those (and other) questions will tell you a lot about the usability of the data to get to the solution of your problem.

Describe briefly in the delivery outputs of the quality check.

If data do not pass, try to clean it or find other data with higher quality.

Data pipeline

The data pipeline is a process extracting RDF dataset out of each data source. There are multiple tools setermined for this purpose. Eventually it is possible to create a simple script in any programming language you are familiar with. However, for most of the cases the following two alternatives should be sufficient to use:

GraphDB (OpenRefine+SPARQL) for processing CSV files, triplifying them, and manipulating the resulting RDF,
Scrapy + GraphDB (OpenRefine+SPARQL) for scraping web pages, triplifying them, and manipulating the resulting RDF.

The resulting RDF datasets should contain all relevant data for the integration task in Checkpoint 2. This means, that all the data to be used during the integration shall be created at the latest in this step.

Data scheme

For every RDF dataset create and deliver UML diagram with the schema. Use the schemas to consider if you really have all the data you need for the data aggregation. In the delivery, try to answer following questions and justify the answers on the UML schemas (I will definitely ask you that during defense):

How are you going to get the answers to your questions? In which attributes of which data sets are they?
How are you going to aggregate the datasets? Which attributes will be used for aggregation and how is it going to be done.

Remind that the goal of the whole semestral work is to integrate the schemas of the datasets you have (integration on temporal/spatial extent is not enough). The purpose of quality check is to prevent this, although it may happen that you will discover low quality of data after its transformation to RDF. If you are still unsure about suitability of the datasets for the semestral work, please let us know, e.g. by email, or come and consult.

Correct examples:

see tutorial 1 (Czech social security administration dataset integration). “Typ posudku - Invalidita - typ řízení zjišťovací” in one dataset determines “Počet nově přiznaných důchodů” in another dataset. By the integration of two data properties (number of granted pensions and number of reviews on new pensions) arises new information (their ratio).
demografická statistika Štatistického úradu SR (Živonarodení v manželstve podle Veku matky) vs. Vybrané demografické údaje (1989-2017) ČSÚ Živě narozené děti podle věku matek při porodu. In CP2 you would decompose such categories (and find out that the latter is a subset of the former in this case)

Incorrect example:

Dataset about number of parking places in Prague vs. Dataset about number of births in Prague - they are only related by the geographical axis (e.g. sharing the geospatial axis - Prague districts), but otherwise they are not connected.

Table of Contents