courses:osw:cp1 [CourseWare Wiki]

Checkpoint 1 (max 20 pts)

Goal

Create a semi-automated data pipeline, which transforms the data sources from Checkpoint 0 into RDF. Each data source will keep its own, separate data schema.

Deliverable

source code for the data pipeline (Scrapy, SPARQL, s-pipes, OpenRefine, etc.) for creating RDF datasets out of data sources
the RDF datasets (outputs of the data pipeline) obtained from the data sources defined in Checkpoint 0
a short description (1-2 page extension of the report from Checkpoint 0) describing the data pipeline, its limitations and benefits, together with a UML class diagram depicting a schema for each dataset.

Details

the data pipeline should extract RDF dataset out of each data source
choose any tools you like (e.g. any programming language you are familiar with) to create the data pipeline. However, for most of the cases the following two alternatives should be sufficient to use:
- GraphDB (OpenRefine+SPARQL) for processing CSV files, triplifying them, and manipulating the resulting RDF
- Scrapy + GraphDB (OpenRefine+SPARQL) for scraping web pages, triplifying them, and manipulating the resulting RDF
the resulting RDF datasets should contain all relevant data for the integration task in Checkpoint 2

Important note about selection of datasets

Remind that the goal of the whole semestral work is to integrate the schemas of the datasets you have (integration on temporal/spatial extent is not enough). Now that you understand better your data, if you are still unsure about suitability of the datasets for the semestral work, please let us know, e.g. by email, or come and consult.

Correct examples:

see tutorial 1 (Czech social security administration dataset integration). “Typ posudku - Invalidita - typ řízení zjišťovací” in one dataset determines “Počet nově přiznaných důchodů” in another dataset.
demografická statistika Štatistického úradu SR (Živonarodení v manželstve podle Veku matky) vs. Vybrané demografické údaje (1989-2017) ČSÚ Živě narozené děti podle věku matek při porodu. In CP2 you would decompose such categories (and find out that the latter is a subset of the former in this case)

Incorrect example:

Dataset about number of parking places in Prague vs. Dataset about number of births in Prague - they are only related by the geographical axis (e.g. sharing the geospatial axis - Prague districts), but otherwise they are not connected.

Table of Contents

Checkpoint 1 (max 20 pts)

Goal

Deliverable

Details

Important note about selection of datasets