HW 7 – Optimization and Comparative Analysis

(10 points)

  1. For each NoSQL DBMS, after completing the target analytical queries:
    • Analyze the performance of one of the queries (by execution time, volume of data scanned/transferred, and server load).
      • “Server load” for this HW means CPU% and I/O wait; include network bytes if available or briefly note observable symptoms (e.g., throttling, spikes) when precise counters are unavailable.
    • Implement one technique to improve performance (for example: creating an index, changing key structures/partitioning, optimizing the storage schema, reducing the volume of returned data, etc.).
    • Compare the result before and after optimization and explain.
    • Briefly describe in the report what you did, why this technique is applicable specifically for your DBMS and your task, and how it affected the result.
  2. Measurement protocol:
    • Dataset: use the Stretch dataset.
    • Warm-up: perform 1 warm-up run (do not count it in statistics) and state whether the following runs are cold or warm.
      • Compare like with like (warm vs warm, cold vs cold).
    • Runs per measurement: run each measurement 10 times.
      • If the query is very fast (<50 ms), use 20 runs.
      • If the query is heavy (>5 s), 5 runs are acceptable — state this explicitly.
      • If time/resources are constrained, you may justify a different number of runs (≥5), but you must explain why.
    • Reported metrics: include median and p95 latency.
    • How to compute p95: sort the N timings ascending and take the element at index ceil(0.95 × N) with 1-based indexing (for N=10 this is the 10th value; for N=20 — the 19th).
    • DBMS-specific instrumentation:
      • MongoDB: explain(“executionStats”) on find/aggregate.
      • Cassandra: TRACING ON (capture the query trace; note consistency level if relevant).
      • Redis: measure per-operation time in series of ×100 identical operations; compute p95 over those 100 timings; SLOWLOG GET 10 if needed.
      • Neo4j: PROFILE (you may use EXPLAIN to inspect the plan without executing).
    • Report “before/after”: time and read volume/plan hits (docs/keys scanned, partitions touched, DB hits).

Examples (guidance)

  • MongoDB
    • Add indexes on key fields (e.g., user_id, date, action type).
    • Compare execution time with vs. without the index (record results).
    • Assess impact of document nesting vs. flatter schema (multikey costs).
    • Prefer compound index order: equality → range → sort; use covered queries where possible.
  • Cassandra
    • Review partitioning (how partition/clustering keys affect latency).
    • Compare latency across key designs (the partition key must drive queries).
    • Assess batch size on inserts (prefer batches to the same partition).
    • Choose compaction strategy in line with workload (STCS/LCS/TWCS). Use LWT sparingly for uniqueness/conditional writes.
  • Redis
    • Use fit-for-purpose structures (e.g., Sorted Sets for top-N; String/Hash for counters).
    • Compare aggregate times with different key/data-structure choices.
    • Show how TTL reduces memory for temporary keys.
    • When comparing pipelined vs non‑pipelined, keep the set of ×100 identical operations the same; compute p95 on that set.
  • Neo4j
    • Evaluate how graph size and traversal depth impact response time.
    • Use indexes or uniqueness constraints where applicable.
    • Compare a simple query vs. MATCH + WHERE on an indexed property.
    • Capture DB Hits and avoid NodeByLabelScan/CartesianProduct where possible.

After executing and optimizing, note any limitations/complexities (latency vs. data volume, bottlenecks as users grow) and propose at least one scaling idea.

Comparative Analysis

  • Compare how easily/quickly/conveniently your queries were executed in each DBMS, and which limitations or surprises you encountered.
  • Analyze the optimization techniques applied.
  • For each task, explain why this query/type of analysis is not implemented (or is implemented poorly) in other DBMSs. Example: relationship search in Neo4j — why it is impossible (or inefficient) in Cassandra/Redis.
  • Create a mini-table (for the queries you actually implemented):

| Query | Selected DBMS | Why not others |

Scaling, schema changes, and “what if it doesn’t fit?”

  • Briefly describe, for each implemented query or analytics, what limitations, compromises, or potential problems may arise when scaling, changing the schema, or increasing data volume. If a new, unexpected business query becomes difficult or impossible with the chosen design, state what you would do (e.g., add a query‑driven table/materialized view; adjust partition keys; precompute rollups; introduce a specialized store).

For each implemented query or analytics, describe what limitations, compromises, or potential problems may arise when scaling, changing the schema, or increasing the data volume. What should be done if the chosen storage design makes a new, unexpected business query difficult or impossible?

Submission Instructions

  • Submit to the BRUTE system:
    • HW7.docx file.

Deadline

  • Sunday 21. 12. 2025 until 23:59
courses/be4m36ds2/homework/hw7.txt · Last modified: 2025/09/21 18:43 by prokoyul