===== Checkpoint 2 (max 25 pts) =====


==== Goal ====
Based on the model and data gathered in the [[cp1|Checkpoint 1]], students will create OWL ontology for answering the question and RDFize the data sources. The output of the checkpoint is a formalized ontology describing the domain, set of ontologies describing specific datasets and RDFized data annotated by the ontologies (formalized dataset models). This checkpoint done right is about 70 % of all the work.

==== Deliverable ====

We can divide the second checkpoint into two main deliverables -- the first is ontologies describing the conceptual model and dataset models and the second is the RDFization pipeline, transforming data into the RDF serialization corresponding to the dataset ontologies.

For the ontology creation, deliver:
  * ontology of conceptual model, containing unambiguously defined concepts (with definitions, sources, etc., using SKOS, RDF(S), or other high-level ontology) and their interconnection,
  * ontologies of dataset models, describing the content of specific datasets. Deliver one ontology per dataset.

For the RDFization pipeline, deliver:
  * source code for the data pipeline (Scrapy, Python/Java/Ruby program, OntoRefine config file, SPARQL queries, etc.) for creating RDF datasets out of data sources,
  * the RDF datasets (outputs of the data pipeline) obtained from the data sources defined in [[cp1|Checkpoint 1]], corresponding to the dataset models.

Besides that, take the PDF delivered for [[cp1| Checkpoint 1]] and extend it by
  * description of the ontology creation,
  * description of the data pipeline, its limitations and benefits, and dummy-proof tutorial on how to run it.

** On the next tutorial, everyone takes a quick (4 minutes) presentation of the pipeline. **
==== Details ====

== OWL ontologies for conceptual model describing the domain ==

Create OWL ontology in an appropriate tool describing the domain. Use the model (E-R, UML, other) as a basis for the ontology. Start with the creation of a SKOS glossary, creating a concept scheme as a container for all the domain concepts. This glossary shall contain all of the concepts with proper annotations (prefLabels and altLabels, definitions, sources, etc.). Then use RDF(S) and OWL to add stereotypes of the concepts (classes, properties, objects, relators, events...) and create relations between terms (defining domains and ranges for object-/data properties, creating sub-classes and/or sub-properties). The output shall be an RDF file describing the real-world knowledge of the domain, allowing answering the question on the abstract level (e.g. any instance of a class is connected through the relations to instances of other classes).

== OWL ontologies for dataset models ==

To formalize dataset models create a SKOS glossary per each dataset, with unambiguous definitions. Concepts from the single datasets will likely have the same source (but do not necessarily have to). Stereotype concepts using OWL and RDF(S). For the formal description of the datasets, you may use some formal ontologies, e.g. {{ https://github.com/kbss-cvut/popis-dat-ontology | ontology for dataset description }}. The output shall be a set of RDF files per dataset formally describing dataset schema. Do not interconnect them to each other nor to the formalized conceptual model ontology.

== Data check ==

The first essential part of the task is a data check. Before setting up a pipeline, double-check that the data contains all the information you need and that it contains it to the full extent. Try to check the credibility of the data. Who is its originator (some official authority)? Are the data referential? Are they up-to-date? Answering those (and other) questions will tell you a lot about the usability of the data to get to the solution to your problem. 

If data do not pass, try to clean it or find other data with higher quality.

== RDFization pipeline ==

The data pipeline is a process extracting the RDF dataset out of each data source. There are multiple tools determined for this purpose. Eventually, it is possible to create a simple script in any programming language you are familiar with. Feel free to use any tools you are used to, and take into account the complexity of the data you are processing. Here is the list of some tools:

  * GraphDB + OpenRefine + SPARQL -- suitable for tabular data without any more complex relations (probably doable, but not really simple). Do not forget SPARQL has the UPDATE statement.
  * [[https://scrapy.org/|Scrapy]] -- perfect tool for getting data on the web. Beware that some administrative bodies do not like scraping their pages (which is in most cases only thing they can legally do about it -- but politeness is good, so try to write an email),
  * python, java, ruby, any other PL you like -- use libraries for processing data as RDF triples (rdflib, rdf-pandas, RDF4J, ruby-rdf, etc.). This is probably the most efficient way how to handle data if you are not sure, how complex it may become. Requires some special skills, which are not part of this course, but the students at your level shall have it.

Feel free to serialize data into any RDF format, but keep in mind, that specific formats have specific usage (e.g. JSON-LD is nice to build web apps, but not really supported by triple stores, etc.). **Make sure that RDFized data corresponds to the dataset models delivered in the first part of the checkpoint**.

If your data contains any spatial information, be extra careful (better consult it with the lecturer) and use correct spatial representation.

The resulting RDF datasets should contain all relevant data for the integration task in [[cp3|Checkpoint 3]]. Take into account that interconnecting the data is part of the 3rd checkpoint.

<note important>Remind that the goal of the whole semestral work is to **combine data, information, and knowledge** based on the datasets you have. The purpose of a quality check is to prevent future problems, although it may happen that you will discover low-quality data after its transformation to RDF. If you are still unsure about the suitability of the datasets for the semestral work, please let us know, e.g. by email, or come and consult.</note>