HW 0 – Topic description and list of analytical tasks

(5 points)

  1. Choose the project topic (for example, an educational, commerce, or social online platform) (0.2 p.).
    • The topic does not have to be unique. The same subject domain across teams is allowed; however, each team’s dataset and task set must be original and not identical to others.
    • Tip for choosing a topic:
      • Your topic must naturally have Users, Objects, Events (time-stamped actions), and Relationships/graph (e.g., user↔user, user↔object).
      • You can produce ≥5k users, ≥10k objects, ≥100k events, ≥10k relationships, with ≥90 days of timestamps, diurnal/weekly seasonality, and Zipf/heavy-tail skew.
      • You can define ≥10/15/20 original queries (solo/pair/trio) covering time filters, aggregates, top-N, timelines, graph; each with goal, inputs/period, outputs, granularity, sorting/top-N, frequency, acceptance.
  2. Describe the domain and the participants (1 p.):
    • Describe the type of platform and the key user roles (for example, student/instructor; buyer/seller; reader/author).
    • Describe the types of objects (courses, products, posts/groups, etc.).
    • Describe the events/actions (view, purchase, comment, subscription, course completion, etc.).
    • Draw an ER diagram (or an equivalent UML Class/Relationship model) with the key entities and relationships.
  3. Formulate in words the analytical tasks that may be relevant for your platform (2 p.).
    • Each team’s task set must be original (not identical to other teams). The same subject domain is permitted.
    • For each task, provide the following template:
      • Business goal,
      • Input (conditions/filters/period),
      • Output (fields/metrics/aggregates),
      • Granularity (by days/users/objects),
      • Sorting / top-N,
      • Execution frequency (one-off/daily),
      • Acceptance criterion (what “done/correct” means; include the expected order of magnitude of result cardinality).
    • Examples of acceptable task formulations:
      • Find users from country X who performed action Y during the last calendar month; return user_id, name, number of actions, ordered by the number of actions in descending order.
      • Get the chronology of actions for user U for the last 7 days; return a timeline with action type and metadata.
      • Count daily views of object O for the last 30 days; return date and count.
      • Identify users related to object Z through a joint project/participation; return the list of users and the type of relationship.
      • Build the top-10 most active users for the week; return user_id, name, number of actions, type of activity.
      • Find community clusters among active users; return the cluster lists and numerical centrality metrics.
    • There must be at least 10 / 15 / 20 tasks (for teams of 1 / 2 / 3 participants), covering time filtering, aggregates, top-N, action histories, and relationships/graph relations.
  4. Define the data structure (without DBMS terms) (0.6 p.):
    • Users: user_id, name, country, registration_date, …
    • Objects: object_id, type, name/title, attributes, …
    • Events: event_id, user_id, object_id, action, timestamp, details.
    • Relationships: user_id1, user_id2, relation_type, date.
    • Specify identifiers and natural uniqueness rules; briefly state key cardinalities and typical value distributions.
  5. Generate/collect the dataset (0.9 p.):
    • Prepare CSV/JSON files:
      • The generator must be parameterizable (seed, N_users, N_objects, N_events, date range, activity/popularity skew).
      • Base cardinalities: Users ≥ 5,000; Objects ≥ 10,000; Events ≥ 100,000; Relationships ≥ 10,000.
      • Stretch cardinalities: Users ≥ 20,000; Objects ≥ 50,000; Events ≥ 500,000; Relationships ≥ 50,000.
      • Time coverage ≥ 90 days with realistic seasonality; use skewed distributions (Zipf/power-law).
      • Reproducibility: provide a one-command HOWTO; fix the exact parameters and seed in the report.
      • Same domain allowed; datasets and task sets must be original (not identical).
      • Describe the generation rules/sources, value ranges, and the share of “active”/“passive,” etc.
      • You may use generators or open datasets (cite sources).
  6. Ensure reproducibility (0.3 p.):
    • Fix the version of the generator/scripts and the random seed.
    • Provide a short “how to reconstruct the data” section and a single command/script to regenerate all files end-to-end.

Submit to the BRUTE system:

  • Hw0.docx file.
  • datasets folder (archive) — generation scripts and the datasets themselves (e.g., <login>_hw0_datasets.zip).
  • Deadline: Sunday 12. 10. 2025 until 23:59.

EXAMPLE of the data structure and tasks description

courses/be4m36ds2/homework/hw0.txt · Last modified: 2025/10/02 17:46 by prokoyul