====== HW 0 – Topic description and list of analytical tasks ====== **(5 points)** - Choose the project topic (for example, an educational, commerce, or social online platform) (0.2 p.). * The topic does not have to be unique. The same subject domain across teams is allowed; however, each team’s dataset and task set must be original and not identical to others. * **Tip for choosing a topic:** * Your topic must naturally have **Users**, **Objects**, **Events** (time-stamped actions), and **Relationships**/graph (e.g., user↔user, user↔object). * You can produce ≥5k users, ≥10k objects, ≥100k events, ≥10k relationships, with ≥90 days of timestamps, diurnal/weekly seasonality, and Zipf/heavy-tail skew. * You can define ≥10/15/20 original queries (solo/pair/trio) covering time filters, aggregates, top-N, timelines, graph; each with goal, inputs/period, outputs, granularity, sorting/top-N, frequency, acceptance. - Describe the domain and the participants (1 p.): * Describe the type of platform and the key user roles (for example, student/instructor; buyer/seller; reader/author). * Describe the types of objects (courses, products, posts/groups, etc.). * Describe the events/actions (view, purchase, comment, subscription, course completion, etc.). * Draw an ER diagram (or an equivalent UML Class/Relationship model) with the key entities and relationships. - Formulate in words the analytical tasks that may be relevant for your platform (2 p.). * Each team’s task set must be original (not identical to other teams). The same subject domain is permitted. * For each task, provide the following template: * Business goal, * Input (conditions/filters/period), * Output (fields/metrics/aggregates), * Granularity (by days/users/objects), * Sorting / top-N, * Execution frequency (one-off/daily), * Acceptance criterion (what “done/correct” means; include the expected order of magnitude of result cardinality). * Examples of acceptable task formulations: * Find users from country X who performed action Y during the last calendar month; return user_id, name, number of actions, ordered by the number of actions in descending order. * Get the chronology of actions for user U for the last 7 days; return a timeline with action type and metadata. * Count daily views of object O for the last 30 days; return date and count. * Identify users related to object Z through a joint project/participation; return the list of users and the type of relationship. * Build the top-10 most active users for the week; return user_id, name, number of actions, type of activity. * Find community clusters among active users; return the cluster lists and numerical centrality metrics. * There must be at least 10 / 15 / 20 tasks (for teams of 1 / 2 / 3 participants), covering time filtering, aggregates, top-N, action histories, and relationships/graph relations. - Define the data structure (without DBMS terms) (0.6 p.): * Users: user_id, name, country, registration_date, … * Objects: object_id, type, name/title, attributes, … * Events: event_id, user_id, object_id, action, timestamp, details. * Relationships: user_id1, user_id2, relation_type, date. * Specify identifiers and natural uniqueness rules; briefly state key cardinalities and typical value distributions. - Generate/collect the dataset (0.9 p.): * Prepare CSV/JSON files: * The generator must be parameterizable (seed, N_users, N_objects, N_events, date range, activity/popularity skew). * Base cardinalities: Users ≥ 5,000; Objects ≥ 10,000; Events ≥ 100,000; Relationships ≥ 10,000. * Stretch cardinalities: Users ≥ 20,000; Objects ≥ 50,000; Events ≥ 500,000; Relationships ≥ 50,000. * Time coverage ≥ 90 days with realistic seasonality; use skewed distributions (Zipf/power-law). * Reproducibility: provide a one-command HOWTO; fix the exact parameters and seed in the report. * Same domain allowed; datasets and task sets must be original (not identical). * Describe the generation rules/sources, value ranges, and the share of “active”/“passive,” etc. * You may use generators or open datasets (cite sources). - Ensure reproducibility (0.3 p.): * Fix the version of the generator/scripts and the random seed. * Provide a short “how to reconstruct the data” section and a single command/script to regenerate all files end-to-end. Submit to the BRUTE system: * Hw0.docx file. * datasets folder (archive) — generation scripts and the datasets themselves (e.g., _hw0_datasets.zip). * Deadline: **Sunday 12. 10. 2025** until 23:59. **[[https://docs.google.com/document/d/1eIDT8An8FHA_okTBddTVZ5R89Hd05VMWaqwVIhvrqt4|EXAMPLE of the data structure and tasks description]]**