This page is located in archive.

Checkpoint 0 (max 5 pts)

Deadline 4. 10. 2020


The goal of this checkpoint is to define the topic of your data integration task, get the question you need answers for and select datasets containing that information.


A PDF consisting of 1-2 pages describing

  1. the topic you created, your motivation for the topic and possible use-cases and beneficiaries
  2. a list of data sources which you found and which are relevant for the topic + basic info for each data source (how complex the data schema is, how much data the data source contains, what kind of data it contains, what organization created the data etc.)
  3. selection of data sources out of this list which you will use for your semestral work. It must be at least three datasets from various sources (catalogue is not a source, but if searching for data in catalogue, better check the author).

On the next tutorial, every student has 2-3 minutes to present their topic and motivation, question and selected datasets. (Quick, everything in max 60 minutes.)



By selecting topic, try to think in global way. The problem on the global level may be solved differently on the lower granularity levels. It often happens to the students, that they find the low quality of their chosen data in the later stages of the project. By selecting a global topic, it is easier to get similar data on another level, e.g. from other country.

It is crucial to define the problem you want to solve. From the problem comes the question, selection of datasets and everything else. Think twice about the topic – it shall be something with visible use-cases and benefits, in ideal case something you personally are interested in.

You should create a topic (e.g. “Effectiveness of precautions against various deseases”) and shortly describe its motivation, use-case and purpose. Your topic might span (but is not limited to) the following areas :

  • Covid-19 (its spread, economical impact, effectivness of precautions etc…),
  • Ecology and climate change (anthropogenic influence on environment, floods, ice melting, greenhouse effect etc… ),
  • Economical crisis (central banks, fiat money, inflation etc…) and/or
  • its combinations (influence of covid-19 precautions on economy, inflation and ecology etc…).

While selecting the datasets, think about the value you get from them to solve your problem. Super rich data are not useful for you, if they does not contain one piece of information you need. For every dataset, try to answer following questions (and include answers in the delivery):

  1. How does this dataset enrich other datasets? What kind of information does it bring to the cocktail?
  2. How are you planning to link this dataset to the other datasets? Is there any other connection than spatial or temporal?
  3. What is the question you are going to ask this data? Which information are you going to aggregate?

It is also recommended not to check only the schema, but also the data content. It may be nice to have data schema with 70+ attributes, but those attributes must be filled in the data itself.

Try to look for alternatives. Are the similar data provided elsewhere? E.g. for other country or other domain (not covid, but cholera), etc…

Relevant public data sources are existing data sets, existing ontologies, books, web pages etc. You choose at least three related datasets that are provided by different parties (organizations). Two datasets are related if they have an overlap on a topic, but are not technically integrated (they do not share the same identifiers). As an example, two datasets that are related are e.g. Safety Accident Database by Air Investigation Institute and Aircraft register by the Czech Civil Aviation Authority).

courses/osw/cp0.txt · Last modified: 2020/09/24 17:34 by medmicha