Search
Deadline 15. 11. 2020
Create a semi-automated data pipeline, which transforms the data sources from Checkpoint 0 into RDF. Each data source will keep its own, separate data schema. Crucial part of the checkpoint is data quality check and eventual cleaning.
Before setting up a pipeline, double-check that data contain all the information you need and that it contains it in the full extent. Try to check the credibility of data. Who is its originator? Are the data reference? Are they up-to-date? Answering those (and other) questions will tell you a lot about the usability of the data to get to the solution of your problem.
Describe briefly in the delivery outputs of the quality check.
If data do not pass, try to clean it or find other data with higher quality.
The data pipeline is a process extracting RDF dataset out of each data source. There are multiple tools setermined for this purpose. Eventually it is possible to create a simple script in any programming language you are familiar with. However, for most of the cases the following two alternatives should be sufficient to use:
The resulting RDF datasets should contain all relevant data for the integration task in Checkpoint 2. This means, that all the data to be used during the integration shall be created at the latest in this step.
For every RDF dataset create and deliver UML diagram with the schema. Use the schemas to consider if you really have all the data you need for the data aggregation. In the delivery, try to answer following questions and justify the answers on the UML schemas (I will definitely ask you that during defense):
Correct examples:
Incorrect example: