Table of Contents

Checkpoint 1 (max 20 pts)


Deadline 15. 11. 2020

Goal

Create a semi-automated data pipeline, which transforms the data sources from Checkpoint 0 into RDF. Each data source will keep its own, separate data schema. Crucial part of the checkpoint is data quality check and eventual cleaning.

Deliverable

Details

Data quality check

Before setting up a pipeline, double-check that data contain all the information you need and that it contains it in the full extent. Try to check the credibility of data. Who is its originator? Are the data reference? Are they up-to-date? Answering those (and other) questions will tell you a lot about the usability of the data to get to the solution of your problem.

Describe briefly in the delivery outputs of the quality check.

If data do not pass, try to clean it or find other data with higher quality.

Data pipeline

The data pipeline is a process extracting RDF dataset out of each data source. There are multiple tools setermined for this purpose. Eventually it is possible to create a simple script in any programming language you are familiar with. However, for most of the cases the following two alternatives should be sufficient to use:

The resulting RDF datasets should contain all relevant data for the integration task in Checkpoint 2. This means, that all the data to be used during the integration shall be created at the latest in this step.

Data scheme

For every RDF dataset create and deliver UML diagram with the schema. Use the schemas to consider if you really have all the data you need for the data aggregation. In the delivery, try to answer following questions and justify the answers on the UML schemas (I will definitely ask you that during defense):

  1. How are you going to get the answers to your questions? In which attributes of which data sets are they?
  2. How are you going to aggregate the datasets? Which attributes will be used for aggregation and how is it going to be done.
Remind that the goal of the whole semestral work is to integrate the schemas of the datasets you have (integration on temporal/spatial extent is not enough). The purpose of quality check is to prevent this, although it may happen that you will discover low quality of data after its transformation to RDF. If you are still unsure about suitability of the datasets for the semestral work, please let us know, e.g. by email, or come and consult.

Correct examples:

Incorrect example: