courses:b4m36osw:cp2 [CourseWare Wiki]

Checkpoint 2 (max 25 pts)

Deadline 21. 11. 2021

Goal

Based on the model and data gathered in the Checkpoint 1, students will create OWL ontology for answering the question and RDFize the data sources. The output of the checkpoint is an interconnected set of ontologies describing the domain and specific datasets and RDFized data annotated by the ontologies.

Deliverable

OWL ontologies descibing the domain (and eventually data sources),
source code for the data pipeline (Scrapy, SPARQL, s-pipes, OpenRefine, etc.) for creating RDF datasets out of data sources,
the RDF datasets (outputs of the data pipeline) obtained from the data sources defined in Checkpoint 1
a short description (1-2 page extension of the report from Checkpoint 1) describing the creation of ontologies, data pipeline, its limitations and benefits, together with a E-R model describing each data set.

Details

OWL ontologies for domain and data sources

Create OWL ontology in an appropriate tool describing the domain. Use the E-R model as a basis for the ontology. Start with creation of a SKOS glossary – a list of terms appearing in a domain with attributes such as preferred label, description, source of term etc. Then add OWL stereotypes and create relations between terms. Next step may be modelling the structure of data sets and their mapping to the domain ontology (using specialization or related properties). The final model complex shall describe all the data sources and interconnecting their classes and properties based on the domain model.

Data check

Before setting up a pipeline, double-check that data contain all the information you need and that it contains it in the full extent. Try to check the credibility of data. Who is its originator? Are the data referential? Are they up-to-date? Answering those (and other) questions will tell you a lot about the usability of the data to get to the solution of your problem.

Describe briefly in the delivery outputs of the quality check.

If data do not pass, try to clean it or find other data with higher quality.

RDFization pipeline

The data pipeline is a process extracting RDF dataset out of each data source. There are multiple tools determined for this purpose. Eventually it is possible to create a simple script in any programming language you are familiar with. For simple cases the following two alternatives should be sufficient to use:

GraphDB (OpenRefine+SPARQL) for processing CSV files, triplifying them, and manipulating the resulting RDF,
Scrapy + GraphDB (OpenRefine+SPARQL) for scraping web pages, triplifying them, and manipulating the resulting RDF.

JSON format is RDFized by adding a context, but keep in mind that only some of the triple stores support it (and even less in its current version).

Keep in mind that handling of spatial data (coordinates) is more complicated than it looks. Better consult it with lecturer beforehand.

The resulting RDF datasets should contain all relevant data for the integration task in Checkpoint 3. This means, that all the data to be used during the integration shall be created at the latest in this step.

Data scheme

For every RDF dataset create and deliver UML or E-R diagram with the schema. Use the schemas to consider if you really have all the data you need for the data aggregation. In the delivery, try to justify how are you going to answer the question using the data you have gathered and which attributes are you going to use.

Remind that the goal of the whole semestral work is to combine data, information and knowledge based on the datasets you have. The purpose of quality check is to prevent this, although it may happen that you will discover low quality of data after its transformation to RDF. If you are still unsure about suitability of the datasets for the semestral work, please let us know, e.g. by email, or come and consult.

Table of Contents