Search
Deadline 21. 11. 2021
Based on the model and data gathered in the Checkpoint 1, students will create OWL ontology for answering the question and RDFize the data sources. The output of the checkpoint is an interconnected set of ontologies describing the domain and specific datasets and RDFized data annotated by the ontologies.
Create OWL ontology in an appropriate tool describing the domain. Use the E-R model as a basis for the ontology. Start with creation of a SKOS glossary – a list of terms appearing in a domain with attributes such as preferred label, description, source of term etc. Then add OWL stereotypes and create relations between terms. Next step may be modelling the structure of data sets and their mapping to the domain ontology (using specialization or related properties). The final model complex shall describe all the data sources and interconnecting their classes and properties based on the domain model.
Before setting up a pipeline, double-check that data contain all the information you need and that it contains it in the full extent. Try to check the credibility of data. Who is its originator? Are the data referential? Are they up-to-date? Answering those (and other) questions will tell you a lot about the usability of the data to get to the solution of your problem.
Describe briefly in the delivery outputs of the quality check.
If data do not pass, try to clean it or find other data with higher quality.
The data pipeline is a process extracting RDF dataset out of each data source. There are multiple tools determined for this purpose. Eventually it is possible to create a simple script in any programming language you are familiar with. For simple cases the following two alternatives should be sufficient to use:
JSON format is RDFized by adding a context, but keep in mind that only some of the triple stores support it (and even less in its current version).
Keep in mind that handling of spatial data (coordinates) is more complicated than it looks. Better consult it with lecturer beforehand.
The resulting RDF datasets should contain all relevant data for the integration task in Checkpoint 3. This means, that all the data to be used during the integration shall be created at the latest in this step.
For every RDF dataset create and deliver UML or E-R diagram with the schema. Use the schemas to consider if you really have all the data you need for the data aggregation. In the delivery, try to justify how are you going to answer the question using the data you have gathered and which attributes are you going to use.