HW 0 – Topic description and list of analytical tasks

(5 points)

Choose the project topic (for example, an educational, commerce, or social online platform) (0.2 p.).
- The topic does not have to be unique. The same subject domain across teams is allowed; however, each team’s dataset and task set must be original and not identical to others.
- Tip for choosing a topic:
  - Your topic must naturally have Users, Objects, Events (time-stamped actions), and Relationships/graph (e.g., user↔user, user↔object).
  - You can produce ≥5k users, ≥10k objects, ≥100k events, ≥10k relationships, with ≥90 days of timestamps, diurnal/weekly seasonality, and Zipf/heavy-tail skew.
  - You can define ≥10/15/20 original queries (solo/pair/trio) covering time filters, aggregates, top-N, timelines, graph; each with goal, inputs/period, outputs, granularity, sorting/top-N, frequency, acceptance.
Describe the domain and the participants (1 p.):
- Describe the type of platform and the key user roles (for example, student/instructor; buyer/seller; reader/author).
- Describe the types of objects (courses, products, posts/groups, etc.).
- Describe the events/actions (view, purchase, comment, subscription, course completion, etc.).
- Draw an ER diagram (or an equivalent UML Class/Relationship model) with the key entities and relationships.
Formulate in words the analytical tasks that may be relevant for your platform (2 p.).
- Each team’s task set must be original (not identical to other teams). The same subject domain is permitted.
- For each task, provide the following template:
  - Business goal,
  - Input (conditions/filters/period),
  - Output (fields/metrics/aggregates),
  - Granularity (by days/users/objects),
  - Sorting / top-N,
  - Execution frequency (one-off/daily),
  - Acceptance criterion (what “done/correct” means; include the expected order of magnitude of result cardinality).
- Examples of acceptable task formulations:
  - Find users from country X who performed action Y during the last calendar month; return user_id, name, number of actions, ordered by the number of actions in descending order.
  - Get the chronology of actions for user U for the last 7 days; return a timeline with action type and metadata.
  - Count daily views of object O for the last 30 days; return date and count.
  - Identify users related to object Z through a joint project/participation; return the list of users and the type of relationship.
  - Build the top-10 most active users for the week; return user_id, name, number of actions, type of activity.
  - Find community clusters among active users; return the cluster lists and numerical centrality metrics.
- There must be at least 10 / 15 / 20 tasks (for teams of 1 / 2 / 3 participants), covering time filtering, aggregates, top-N, action histories, and relationships/graph relations.
Define the data structure (without DBMS terms) (0.6 p.):
- Users: user_id, name, country, registration_date, …
- Objects: object_id, type, name/title, attributes, …
- Events: event_id, user_id, object_id, action, timestamp, details.
- Relationships: user_id1, user_id2, relation_type, date.
- Specify identifiers and natural uniqueness rules; briefly state key cardinalities and typical value distributions.
Generate/collect the dataset (0.9 p.):
- Prepare CSV/JSON files:
  - The generator must be parameterizable (seed, N_users, N_objects, N_events, date range, activity/popularity skew).
  - Base cardinalities: Users ≥ 5,000; Objects ≥ 10,000; Events ≥ 100,000; Relationships ≥ 10,000.
  - Stretch cardinalities: Users ≥ 20,000; Objects ≥ 50,000; Events ≥ 500,000; Relationships ≥ 50,000.
  - Time coverage ≥ 90 days with realistic seasonality; use skewed distributions (Zipf/power-law).
  - Reproducibility: provide a one-command HOWTO; fix the exact parameters and seed in the report.
  - Same domain allowed; datasets and task sets must be original (not identical).
  - Describe the generation rules/sources, value ranges, and the share of “active”/“passive,” etc.
  - You may use generators or open datasets (cite sources).
Ensure reproducibility (0.3 p.):
- Fix the version of the generator/scripts and the random seed.
- Provide a short “how to reconstruct the data” section and a single command/script to regenerate all files end-to-end.

Submit to the BRUTE system:

Hw0.docx file.
datasets folder (archive) — generation scripts and the datasets themselves (e.g., <login>_hw0_datasets.zip).

Deadline: Sunday 12. 10. 2025 until 23:59.

EXAMPLE of the data structure and tasks description