===== Checkpoint 1 (max 20 pts) ===== ---- ==== Goal ==== Create a semi-automated data pipeline, which transforms the data sources from [[cp0|Checkpoint 0]] into RDF. Each data source will keep its own, separate data schema. ==== Deliverable ==== * source code for the data pipeline ([[https://scrapy.org/|Scrapy]], [[https://www.w3.org/TR/sparql11-query/|SPARQL]], s-pipes, [[http://openrefine.org/|OpenRefine]], etc.) for creating RDF datasets out of data sources * the RDF datasets (outputs of the data pipeline) obtained from the data sources defined in [[cp0|Checkpoint 0]] * a short description (1-2 page extension of the report from [[cp0|Checkpoint 0]]) describing the data pipeline, its limitations and benefits, together with a UML class diagram depicting a schema for each dataset. ==== Details ==== * the data pipeline should extract RDF dataset out of each data source * choose any tools you like (e.g. any programming language you are familiar with) to create the data pipeline. However, for most of the cases the following two alternatives should be sufficient to use: * GraphDB (OpenRefine+SPARQL) for processing CSV files, triplifying them, and manipulating the resulting RDF * [[https://scrapy.org/|Scrapy]] + GraphDB (OpenRefine+SPARQL) for scraping web pages, triplifying them, and manipulating the resulting RDF * the resulting RDF datasets should contain all relevant data for the integration task in [[cp2|Checkpoint 2]]