HW 7 – Optimization and Comparative Analysis

(10 points)

  1. For each NoSQL DBMS (MongoDB, Cassandra, Neo4j), after completing the target analytical queries:
    • Analyze the performance of one of the queries (by execution time, volume of data scanned).
    • Implement one technique to improve performance (for example: creating an index, changing key structures/partitioning, optimizing the storage schema, reducing the volume of returned data, etc.).
    • Compare the result before and after optimization and explain.
    • Briefly describe in the report what you did, why this technique is applicable specifically for your DBMS and your task, and how it affected the result.
  2. Measurement protocol:
    • Dataset: use the Stretch dataset.
    • DBMS-specific instrumentation:
      • MongoDB: explain(“executionStats”) on find/aggregate.
      • Cassandra: TRACING ON (capture the query trace; note consistency level if relevant).
      • Neo4j: PROFILE (you may use EXPLAIN to inspect the plan without executing).
    • Report “before/after”: time and read volume/plan hits (docs/keys scanned, partitions touched, DB hits).

Examples (guidance)

  • MongoDB
    • Add indexes on key fields (e.g., user_id, date, action type).
    • Compare execution time with vs. without the index (record results).
    • Assess impact of document nesting vs. flatter schema (multikey costs).
    • Prefer compound index order: equality → range → sort; use covered queries where possible.
  • Cassandra
    • Review partitioning (how partition/clustering keys affect latency).
    • Compare latency across key designs (the partition key must drive queries).
    • Assess batch size on inserts (prefer batches to the same partition).
  • Neo4j
    • Evaluate how graph size and traversal depth impact response time.
    • Use indexes or uniqueness constraints where applicable.
    • Compare a simple query vs. MATCH + WHERE on an indexed property.
    • Capture DB Hits and avoid NodeByLabelScan/CartesianProduct where possible.

Comparative Analysis

  • Compare how easily/quickly/conveniently your queries were executed in each DBMS, and which limitations or surprises you encountered.
  • Analyze the optimization techniques applied.
  • For each task, explain why this query/type of analysis is not implemented (or is implemented poorly) in other DBMSs. Example: relationship search in Neo4j — why it is impossible (or inefficient) in Cassandra/Redis.
  • Create a mini-table (for the queries you actually implemented):

| DBMS | Query / task | Optimization technique | Median_before | Median_after | Read volume before | Read volume after |

Scaling, schema changes, and “what if it doesn’t fit?”

  • Briefly describe, for each implemented query or analytics, what limitations, compromises, or potential problems may arise when scaling, changing the schema, or increasing data volume. If a new, unexpected business query becomes difficult or impossible with the chosen design, state what you would do (e.g., add a query‑driven table; adjust partition keys; precompute rollups; introduce a specialized store).

Submission Instructions

  • Submit to the BRUTE system:
    • HW7.docx file.

Deadline

  • Sunday 28. 12. 2025 until 23:59
courses/b4m36ds2/homework/hw7.txt · Last modified: 2025/12/10 01:25 by prokoyul