Warning

# Final Assignment 2023

## Summary

For the final assignment, you will form groups, find a suitable dataset (or multitude) and perform statistical analysis to answer a complex question using available data. Ideally, start by posing the question and then go on to find appropriate datasets (online or offline).

The goal is for you to acquire an understanding of the whole statistical process from the ve. Yry beginning to the very end. You should understand that as statisticians, you will be the ones to formalize real-world questions, you will never achieve clear and perfect results, and there will always be things out of your control. Nevertheless, you have to do your best to apply formal methods to real and important problems and convince both expert and laic audience of your conclusions.

Conceptually, you should go through the following steps to complete the assignment:

1. formulate a question Q,
2. formulate a plan to answer Q
3. perform necessary data manipulations (cleaning, …),
4. gather arguments to answer Q,
5. interpret the results (formulate the answer),
6. judge shortcomings of your work and
7. create a report to communicate your findings,

Throughout the process, you will have several checkpoints so that you are not alone during the process and have some feedback. The first checkpoint is finding your question (step 1), where we will try to calibrate the difficulty of the assignment with you. The second checkpoint is formulating a plan (step 2), which is a substantial part of your work, and you will turn it in as a standalone deliverable. Following this, you should have a team consultation with a tutor. The third checkpoint is your report (step 7), which will be read and reviewed by another team of your peers. You will be asked to review some other team in turn as well. Finally, with the feedback on your report, you will prepare a final presentation (step 8).

## What Could Be a Suitable Problem?

Apart from going through the whole statistical process yourself, we want you to try applying techniques from SAN. So, think a bit about what you learned. Not only the obvious methods like the linear models and classifiers, you can even try some power analysis to judge how much data you need, outlier analysis to see if some data might be wrong, robust methods to deal with noisy information, etc.

Some examples of suitable types of questions for this assignment are:

• a general statement about a group (answered by observing individuals), e.g. “How do individual components of bad lifestyle (diet, lack of exercise, sleeping habits, etc.) interact in causing health complications later in life?”,
• describing unobservable properties of some process (latent variables), e.g. “What is the rate of requests for a datacenter? Does it vary with the weather and time of the week? Could the results be used for optimizing its operation?”,
• create a causal model of some process, e.g. “How does the effect of socioeconomic factors propagate with children throughout the years and levels of education?”
• provide new insight, e.g. “Can we identify specific cohorts in visitors of a website, create a predictive model and use it to tailor the UI to the specific user's behaviour?”

Some ideas for where to look for topic inspiration:

• try to resolve some controversial public disputes (surrounding climate change, ecological policies, access to education, financing of science in ČR, …),
• provide some analysis that could serve as the basis for public policy-making, maybe even a recommendation (planning hospital capacities, different kinds of public transport and closing the city centre to automobile traffic, …),
• analysis that could help individual decision-making (features of success of academic publications based on citation networks data, …)
• try to solve some open data-related questions/problems proposed in scientific literature or ask people who could have such a problem (small company, your colleagues, …)
• open statistical competitions/hackathons (e.g. https://statistics-awards.eu/nowcasting/, https://www.hackhealth.eu…)

To help you get some inspiration when looking for problems, you can look at the following places where you can find many interesting datasets, if you really do not have any idea or personal interest:

We encourage students to be creative. Even alternative ways to pass the assignment are possible as long as the ideas of the assignment are preserved—for example, participation in a statistics-oriented challenge of the HackHealth hackathon by the whole team would fulfil the requirements. Note: it is hard to tell from the small info how much a challenge will, in reality, be statistics-oriented. Out of this year's challenges, only “Automating Albuminuria Screening for CKD” looks to be statistically complex enough, but you would have to show that your work was in the spirit of this assignment.

## Teams

Students should form teams of 4 people (or 3 when the total number in a class would not be divisible). The team organizes work between themselves and reports contributions, including a % share of work by individual members as part of the individual work items. The workload expected per person is about 20 hours, so the total per team could be up to 80 hours, which is enough for a really nice piece of work—make it count!

## Work Outcomes ("Deliverables")

The zeroeth submission we want from you (see Section Steps) is a few sentences text file with your general question to quickly give you feedback before you start working on your plan. The rest of the outcomes should be submitted as pdf documents and go as follows:

1. a proposal of the project (question and plan) (max. 2 pages)
Your proposal should contain your research question, identified datasets you want to use and a plan for your work. The plan should contain specific steps you want to take, techniques you will need and an outline of the formal arguments you will use to answer your question. Don't forget to think about what are the problems you could encounter and what problems you could not even know about (some confounding, errors in data, …?).
2. a report of the work (about 8-15 pages)
You should document your progress, partial results, decisions you had to make along the course of the work and your reasoning. It is expected that a lot of your report will be informative figures and such, not too much dense text. However, the report should have a proper form (e.g. be structured, all figures should be described, have labelled axes, text should be reasonably formatted, etc.), be clear for reading (include problem statements, summaries) and contain all necessary detail. You can think of this report as a presentation of your work to an expert audience (your reviewers). The report should also contain a contribution statement, where you describe who did what in a few sentences, including % work share.
3. a review report on the work of another team (1 page)
You will review the work of another team. Your review should contain a summary of the problem, approach and results. Then, you should judge the overall quality of the work, if the approach was sound if you identified some error in the realization of the plan or the final argumentation, etc. Give three or four critical remarks or questions for the team to answer in their presentation.
4. a presentation, that will be presented by the team, including review response(s).
You will present your findings to your peers. You should prepare a presentation for 10 minutes, where you introduce your problem in natural language (very important), briefly say what you did to solve it (not so important), and finally present your findings, arguments and conclusions (very important). Think of this part as a presentation to a laic audience interested in the problem itself and its solution rather than the method. Focus, therefore, on communication, not technicalities. You should prepare a few slides and/or—ideally and—an A1/A2 poster (you do not have to print it!).

After the presentation, you will report in BRUTE the amount of work per member in the whole project, quantified in percentages. This will act as a basis for the assignment of points.

## Timeline

### Steps

1. Team forming.
2. A few sentences proposing the work topic: we will provide feedback to calibrate its difficulty before you start working on a plan. (submitted as txt in BRUTE)
3. You will hand in your plan (D1) at least a day before a consultation. Consultation will be per team with a tutor. We will specify some timeslots, and teams will sign up individually.
4. You will hand in your report (D2) one week before the review deadline.
5. You will hand in your review (D3) a few days before the presentation.
6. You will present your findings and then submit the presented slides and poster (D4).

### Steps in Time

1. lab, week starting 13.11.
2. 19.11. (three weeks before consultation)
3. 10.12.
4. 1.1. (two weeks before 5.)
5. 5.1. (a few days before 6.)
6. last tutorial

### Time Allocation

Generally, you can allocate your time as you wish. We wrote down something of expected time allocation, you can take inspiration. The reason why the planning part is so large we expected you might want to inspect your dataset a bit so some data manipulation might take place already at this stage.

Deliverable Time per Team Time per student (4 person team)
Plan 20h 5h
Work & Report 34h 8,5h
Review 10h 2,5h
Presentation 16h 4h
Total 80h 20h