Warning
This page is located in archive. Go to the latest version of this course pages. Go the latest version of this page.

Checkpoint 1 (max 20 pts)


Goal

Create a semi-automated data pipeline, which transforms the data sources from Checkpoint 0 into RDF. Each data source will keep its own, separate data schema.

Deliverable

  • source code for the data pipeline (Scrapy, SPARQL, s-pipes, OpenRefine, etc.) for creating RDF datasets out of data sources
  • the RDF datasets (outputs of the data pipeline) obtained from the data sources defined in Checkpoint 0
  • a short description (1-2 page extension of the report from Checkpoint 0) describing the data pipeline, its limitations and benefits, together with a UML class diagram depicting a schema for each dataset.

Details

  • the data pipeline should extract RDF dataset out of each data source
  • choose any tools you like (e.g. any programming language you are familiar with) to create the data pipeline. However, for most of the cases the following two alternatives should be sufficient to use:
    • GraphDB (OpenRefine+SPARQL) for processing CSV files, triplifying them, and manipulating the resulting RDF
    • Scrapy + GraphDB (OpenRefine+SPARQL) for scraping web pages, triplifying them, and manipulating the resulting RDF
  • the resulting RDF datasets should contain all relevant data for the integration task in Checkpoint 2
courses/osw/cp1.txt · Last modified: 2018/10/04 18:22 by kremep1