Search
Based on the model and data gathered in the Checkpoint 1, students will create an ontology for answering the question and RDFize the data sources. The output of the checkpoint is a formalized ontology describing the domain, set of ontologies describing specific datasets and RDFized data annotated by the ontologies (formalized dataset models), including the RDFization pipeline. This checkpoint done right is about 70 % of all the work.
We can divide the second checkpoint into two main deliverables – the first is ontologies describing the conceptual model and dataset models and the second is the RDFization pipeline, transforming data into the RDF serialization corresponding to the dataset ontologies.
For the ontology creation, deliver:
For the RDFization pipeline, deliver:
Besides that, take the PDF delivered for Checkpoint 1 and extend it by
On the next tutorial, everyone takes a quick (4 minutes) presentation of the pipeline.
Create an ontology in an appropriate tool describing the domain. Use the model (E-R, UML, other) as a basis for the ontology. Start with the creation of a SKOS glossary, creating a concept scheme as a container for all the domain concepts. This glossary shall contain all of the concepts with proper annotations (prefLabels and altLabels, definitions, sources, etc.). Then use RDF(S) and OWL basics to add stereotypes of the concepts (classes, properties, objects, relators, events…) and create relations between terms (defining domains and ranges for object-/data properties, creating sub-classes and/or sub-properties). The output shall be an RDF file describing the real-world knowledge of the domain, allowing answering the question on the abstract level (e.g. any instance of a class is connected through the relations to instances of other classes).
To formalize dataset models create a SKOS glossary per each dataset, with unambiguous definitions. Concepts from the single datasets will likely have the same source (but do not necessarily have to). Stereotype concepts using RDF(S) and OWL basics. For the formal description of the datasets, you may use some formal ontologies, e.g. ontology for dataset description . The output shall be a set of RDF files per dataset formally describing dataset schema. Do not interconnect them to each other nor to the formalized conceptual model ontology, this will be one of the goals in the 3rd checkpoint.
Before we start working with the datasets and before setting up a pipeline, double-check that the data contains all the information you need and that it contains it to the full extent. Try to check the credibility of the data. Who is its originator (some official authority)? Are the data referential? Are they up-to-date? Answering those (and other) questions will tell you a lot about the usability of the data to get to the solution to your problem.
If data do not pass, try to clean it or find other data with higher quality.
The data pipeline is a process extracting the RDF dataset out of each data source. There are multiple tools determined for this purpose. Eventually, it is possible to create a simple script in any programming language you are familiar with. Feel free to use any tools you are used to, and take into account the complexity of the data you are processing. Here is the list of some tools:
Feel free to serialize data into any RDF format, but keep in mind, that specific formats have specific usage (e.g. JSON-LD is nice to build web apps, but not really supported by triple stores, etc.). Make sure that RDFized data corresponds to the dataset models delivered in the first part of the checkpoint. The best check is working SPARQL query.
If your data contains any spatial information, be extra careful (better consult it with the lecturer) and use correct spatial representation.
The resulting RDF datasets should contain all relevant data for the integration task in Checkpoint 3. Take into account that interconnecting the data is part of the 3rd checkpoint.