====== HW 0 – Topic description and list of analytical tasks ======
**(5 points)**

  - Choose the project topic (for example, an educational, commerce, or social online platform) (0.2 p.).
    * The topic does not have to be unique. The same subject domain across teams is allowed; however, each team’s dataset and task set must be original and not identical to others.
    * **Tip for choosing a topic:**
      * Your topic must naturally have **Users**, **Objects**, **Events** (time-stamped actions), and **Relationships**/graph (e.g., user↔user, user↔object).
      * You can produce ≥5k users, ≥10k objects, ≥100k events, ≥10k relationships, with ≥90 days of timestamps, diurnal/weekly seasonality, and Zipf/heavy-tail skew.
      * You can define ≥10/15/20 original queries (solo/pair/trio) covering time filters, aggregates, top-N, timelines, graph; each with goal, inputs/period, outputs, granularity, sorting/top-N, frequency, acceptance.
  - Describe the domain and the participants (1 p.):
    * Describe the type of platform and the key user roles (for example, student/instructor; buyer/seller; reader/author).
    * Describe the types of objects (courses, products, posts/groups, etc.).
    * Describe the events/actions (view, purchase, comment, subscription, course completion, etc.).
    * Draw an ER diagram (or an equivalent UML Class/Relationship model) with the key entities and relationships.
  - Formulate in words the analytical tasks that may be relevant for your platform (2 p.).
    * Each team’s task set must be original (not identical to other teams). The same subject domain is permitted.
    * For each task, provide the following template:
      * Business goal,
      * Input (conditions/filters/period),
      * Output (fields/metrics/aggregates),
      * Granularity (by days/users/objects),
      * Sorting / top-N,
      * Execution frequency (one-off/daily),
      * Acceptance criterion (what “done/correct” means; include the expected order of magnitude of result cardinality).
    * Examples of acceptable task formulations:
      * Find users from country X who performed action Y during the last calendar month; return user_id, name, number of actions, ordered by the number of actions in descending order.
      * Get the chronology of actions for user U for the last 7 days; return a timeline with action type and metadata.
      * Count daily views of object O for the last 30 days; return date and count.
      * Identify users related to object Z through a joint project/participation; return the list of users and the type of relationship.
      * Build the top-10 most active users for the week; return user_id, name, number of actions, type of activity.
      * Find community clusters among active users; return the cluster lists and numerical centrality metrics.
    * There must be at least 10 / 15 / 20 tasks (for teams of 1 / 2 / 3 participants), covering time filtering, aggregates, top-N, action histories, and relationships/graph relations.
  - Define the data structure (without DBMS terms) (0.6 p.):
    * Users: user_id, name, country, registration_date, …
    * Objects: object_id, type, name/title, attributes, …
    * Events: event_id, user_id, object_id, action, timestamp, details.
    * Relationships: user_id1, user_id2, relation_type, date.
    * Specify identifiers and natural uniqueness rules; briefly state key cardinalities and typical value distributions.
  - Generate/collect the dataset (0.9 p.):
    * Prepare CSV/JSON files:
      * The generator must be parameterizable (seed, N_users, N_objects, N_events, date range, activity/popularity skew).
      * Base cardinalities: Users ≥ 5,000; Objects ≥ 10,000; Events ≥ 100,000; Relationships ≥ 10,000.
      * Stretch cardinalities: Users ≥ 20,000; Objects ≥ 50,000; Events ≥ 500,000; Relationships ≥ 50,000.
      * Time coverage ≥ 90 days with realistic seasonality; use skewed distributions (Zipf/power-law).
      * Reproducibility: provide a one-command HOWTO; fix the exact parameters and seed in the report.
      * Same domain allowed; datasets and task sets must be original (not identical).
      * Describe the generation rules/sources, value ranges, and the share of “active”/“passive,” etc.
      * You may use generators or open datasets (cite sources).
  - Ensure reproducibility (0.3 p.):
      * Fix the version of the generator/scripts and the random seed.
      * Provide a short “how to reconstruct the data” section and a single command/script to regenerate all files end-to-end.

Submit to the BRUTE system:
  * Hw0.docx file.
  * datasets folder (archive) — generation scripts and the datasets themselves (e.g., <login>_hw0_datasets.zip).

  * Deadline: **Sunday 12. 10. 2025** until 23:59. 

**[[https://docs.google.com/document/d/1eIDT8An8FHA_okTBddTVZ5R89Hd05VMWaqwVIhvrqt4|EXAMPLE of the data structure and tasks description]]**