CourseWare Wiki
Switch Term
Winter 2025 / 2026
Winter 2024 / 2025
Winter 2023 / 2024
Winter 2022 / 2023
Winter 2021 / 2022
Winter 2020 / 2021
Winter 2019 / 2020
Search
Log In
b251
courses
be4m36ds2
homework
hw0
HW 0 – Topic description and list of analytical tasks
(5 points)
Choose the project topic (for example, an educational, commerce, or social online platform) (0.2 p.).
The topic does not have to be unique. The same subject domain across teams is allowed; however, each team’s dataset and task set must be original and not identical to others.
Tip for choosing a topic:
Your topic must naturally have
Users
,
Objects
,
Events
(time-stamped actions), and
Relationships
/graph (e.g., user↔user, user↔object).
You can produce ≥5k users, ≥10k objects, ≥100k events, ≥10k relationships, with ≥90 days of timestamps, diurnal/weekly seasonality, and Zipf/heavy-tail skew.
You can define ≥10/15/20 original queries (solo/pair/trio) covering time filters, aggregates, top-N, timelines, graph; each with goal, inputs/period, outputs, granularity, sorting/top-N, frequency, acceptance.
Describe the domain and the participants (1 p.):
Describe the type of platform and the key user roles (for example, student/instructor; buyer/seller; reader/author).
Describe the types of objects (courses, products, posts/groups, etc.).
Describe the events/actions (view, purchase, comment, subscription, course completion, etc.).
Draw an ER diagram (or an equivalent UML Class/Relationship model) with the key entities and relationships.
Formulate in words the analytical tasks that may be relevant for your platform (2 p.).
Each team’s task set must be original (not identical to other teams). The same subject domain is permitted.
For each task, provide the following template:
Business goal,
Input (conditions/filters/period),
Output (fields/metrics/aggregates),
Granularity (by days/users/objects),
Sorting / top-N,
Execution frequency (one-off/daily),
Acceptance criterion (what “done/correct” means; include the expected order of magnitude of result cardinality).
Examples of acceptable task formulations:
Find users from country X who performed action Y during the last calendar month; return user_id, name, number of actions, ordered by the number of actions in descending order.
Get the chronology of actions for user U for the last 7 days; return a timeline with action type and metadata.
Count daily views of object O for the last 30 days; return date and count.
Identify users related to object Z through a joint project/participation; return the list of users and the type of relationship.
Build the top-10 most active users for the week; return user_id, name, number of actions, type of activity.
Find community clusters among active users; return the cluster lists and numerical centrality metrics.
There must be at least 10 / 15 / 20 tasks (for teams of 1 / 2 / 3 participants), covering time filtering, aggregates, top-N, action histories, and relationships/graph relations.
Define the data structure (without DBMS terms) (0.6 p.):
Users: user_id, name, country, registration_date, …
Objects: object_id, type, name/title, attributes, …
Events: event_id, user_id, object_id, action, timestamp, details.
Relationships: user_id1, user_id2, relation_type, date.
Specify identifiers and natural uniqueness rules; briefly state key cardinalities and typical value distributions.
Generate/collect the dataset (0.9 p.):
Prepare CSV/JSON files:
The generator must be parameterizable (seed, N_users, N_objects, N_events, date range, activity/popularity skew).
Base cardinalities: Users ≥ 5,000; Objects ≥ 10,000; Events ≥ 100,000; Relationships ≥ 10,000.
Stretch cardinalities: Users ≥ 20,000; Objects ≥ 50,000; Events ≥ 500,000; Relationships ≥ 50,000.
Time coverage ≥ 90 days with realistic seasonality; use skewed distributions (Zipf/power-law).
Reproducibility: provide a one-command HOWTO; fix the exact parameters and seed in the report.
Same domain allowed; datasets and task sets must be original (not identical).
Describe the generation rules/sources, value ranges, and the share of “active”/“passive,” etc.
You may use generators or open datasets (cite sources).
Ensure reproducibility (0.3 p.):
Fix the version of the generator/scripts and the random seed.
Provide a short “how to reconstruct the data” section and a single command/script to regenerate all files end-to-end.
Submit to the BRUTE system:
Hw0.docx file.
datasets folder (archive) — generation scripts and the datasets themselves (e.g., <login>_hw0_datasets.zip).
Deadline:
Sunday 12. 10. 2025
until 23:59.
EXAMPLE of the data structure and tasks description
courses/be4m36ds2/homework/hw0.txt
· Last modified: 2025/10/02 17:46 by
prokoyul