courses:osw:cp1 [CourseWare Wiki]

Checkpoint 1 (max 20 pts)

Goal

Create a semi-automated data pipeline, which transforms the data sources from Checkpoint 0 into RDF. Each data source will keep its own, separate data schema.

Deliverable

source code for the data pipeline (Scrapy, SPARQL, s-pipes, OpenRefine, etc.) for creating RDF datasets out of data sources
the RDF datasets (outputs of the data pipeline) obtained from the data sources defined in Checkpoint 0
a short description (1-2 page extension of the report from Checkpoint 0) describing the data pipeline, its limitations and benefits, together with a UML class diagram depicting a schema for each dataset.

Details

the data pipeline should extract RDF dataset out of each data source
choose any tools you like (e.g. any programming language you are familiar with) to create the data pipeline. However, for most of the cases the following two alternatives should be sufficient to use:
- GraphDB (OpenRefine+SPARQL) for processing CSV files, triplifying them, and manipulating the resulting RDF
- Scrapy + GraphDB (OpenRefine+SPARQL) for scraping web pages, triplifying them, and manipulating the resulting RDF
the resulting RDF datasets should contain all relevant data for the integration task in Checkpoint 2

Table of Contents

Checkpoint 1 (max 20 pts)

Goal

Deliverable

Details