Search
Deadline 4. 10. 2020
The goal of this checkpoint is to define the topic of your data integration task, get the question you need answers for and select datasets containing that information.
A PDF consisting of 1-2 pages describing
On the next tutorial, every student has 2-3 minutes to present their topic and motivation, question and selected datasets. (Quick, everything in max 60 minutes.)
By selecting topic, try to think in global way. The problem on the global level may be solved differently on the lower granularity levels. It often happens to the students, that they find the low quality of their chosen data in the later stages of the project. By selecting a global topic, it is easier to get similar data on another level, e.g. from other country.
It is crucial to define the problem you want to solve. From the problem comes the question, selection of datasets and everything else. Think twice about the topic – it shall be something with visible use-cases and benefits, in ideal case something you personally are interested in.
You should create a topic (e.g. “Effectiveness of precautions against various deseases”) and shortly describe its motivation, use-case and purpose. Your topic might span (but is not limited to) the following areas :
While selecting the datasets, think about the value you get from them to solve your problem. Super rich data are not useful for you, if they does not contain one piece of information you need. For every dataset, try to answer following questions (and include answers in the delivery):
It is also recommended not to check only the schema, but also the data content. It may be nice to have data schema with 70+ attributes, but those attributes must be filled in the data itself.
Try to look for alternatives. Are the similar data provided elsewhere? E.g. for other country or other domain (not covid, but cholera), etc…
Relevant public data sources are existing data sets, existing ontologies, books, web pages etc. You choose at least three related datasets that are provided by different parties (organizations). Two datasets are related if they have an overlap on a topic, but are not technically integrated (they do not share the same identifiers). As an example, two datasets that are related are e.g. Safety Accident Database by Air Investigation Institute and Aircraft register by the Czech Civil Aviation Authority).