courses:b4m36osw:cp2 [CourseWare Wiki]

Checkpoint 2 (max 25 pts)

Deadline 20. 11. 2022

Goal

Based on the model and data gathered in the Checkpoint 1, students will create OWL ontology for answering the question and RDFize the data sources. The output of the checkpoint is a formalized ontology describing the domain, set of ontologies describing specific datasets and RDFized data annotated by the ontologies (formalized dataset models). This checkpoint done right is about 70 % of all the work.

Deliverable

We can divide second checkpoint into two main deliverables – ontologies describing conceptual model and dataset models and RDFization pipeline, transforming data into the RDF serialization corresponding to the dataset ontologies.

For the ontology creation, deliver:

ontology of conceptual model, containing unambiguously defined concepts (with definitions, sources etc., using SKOS, RDF(S) or other high level ontology) and their interconnection,
ontologies of dataset models, describing content of specific dataset. Deliver one ontology per used dataset.

For the RDFization pipeline deliver:

source code for the data pipeline (Scrapy, Python/Java/Ruby program, OntoRefine config file, SPARQL queries…) for creating RDF datasets out of data sources,
the RDF datasets (outputs of the data pipeline) obtained from the data sources defined in Checkpoint 1

Besides that, take the PDF delivered for Checkpoint 1 and extend it by

description of the ontology creation,
description of data pipeline, its limitations and benefits and dummy-proof tutorial how to run it.

On the next tutorial, everyone takes a quick (3 minutes) presentation of the pipeline.

Details

OWL ontologies for conceptual model describing the domain

Create OWL ontology in an appropriate tool describing the domain. Use the E-R (UML, other) model as a basis for the ontology. Start with creation of a SKOS glossary, creating a concept scheme as a container for all the domain concepts and include all of the concepts. Describe the concepts properly using unambiguous definition, prefLabel and altLabels, source (e.g. legislative document, technical standard…). Then use RDF(S) and OWL to stereotype concepts (classes, properties, objects, relators, events…) and create relations between terms (defining domains and ranges for object-/data properties, creating sub-classes and/or sub-properties).

OWL ontologies for dataset models

To formalize dataset models create SKOS glossary per each dataset, with unambiguous definitions. Concepts from the single datasets will likely have the same source (but does not have to). Stereotype concepts using OWL and RDF(S). For the formal description of the datasets you may use some formal ontologies, e.g. ontology for dataset description .

Data check

First essential part of the task is a data check. Before setting up a pipeline, double-check that data contain all the information you need and that it contains it in the full extent. Try to check the credibility of data. Who is its originator? Are the data referential? Are they up-to-date? Answering those (and other) questions will tell you a lot about the usability of the data to get to the solution of your problem.

If data do not pass, try to clean it or find other data with higher quality.

RDFization pipeline

The data pipeline is a process extracting RDF dataset out of each data source. There are multiple tools determined for this purpose. Eventually it is possible to create a simple script in any programming language you are familiar with. Feel free to use any tools you are used to, take into account the complexity of data you are processing. Here is the list of some tools:

GraphDB + OpenRefine + SPARQL – suitable for tabular data without any more complex relations (probably doable, but not really simple). Do not forget SPARQL has UPDATE statement.
Scrapy – perfect tool for getting data in the web. Beware that some administrative bodies do not like scraping their pages (which is in most case only thing they can legally do about it – but politeness is good, so try to write an email),
python, java, ruby, any other PL you like – use libraries for processing data as RDF triples (rdflib, rdf-pandas, RDF4J, ruby-rdf etc…). This is probably the most efficient way how to handle data if you are not sure, how complex it may become. Requires some special skills, which are not part of this course.

Feel free to serialize data into any RDF format, but keep in mind, that specific formats have specific usage (e.g. JSON-LD is nice to build web apps, but not really supported by triple stores, etc.). Make sure that RDFized data corresponds to the dataset models delivered in the first part of the chcekpoint.

If your data contain any spatial information, be extra careful (better consult it with the lecturer) and use correct spatial representation.

The resulting RDF datasets should contain all relevant data for the integration task in Checkpoint 3.

Remind that the goal of the whole semestral work is to combine data, information and knowledge based on the datasets you have. The purpose of quality check is to prevent this, although it may happen that you will discover low quality of data after its transformation to RDF. If you are still unsure about suitability of the datasets for the semestral work, please let us know, e.g. by email, or come and consult.

Table of Contents