HW 7 – Optimization and Comparative Analysis

(10 points)

  1. For each NoSQL DBMS, after completing the target analytical queries:
    • Analyze the performance of one of the queries (by execution time, volume of data scanned/transferred, and server load).
      • “Server load” for this HW means CPU% and I/O wait; include network bytes if available or briefly note observable symptoms (e.g., throttling, spikes) when precise counters are unavailable.
    • Implement one technique to improve performance (for example: creating an index, changing key structures/partitioning, optimizing the storage schema, reducing the volume of returned data, etc.).
    • Compare the result before and after optimization and explain.
    • Briefly describe in the report what you did, why this technique is applicable specifically for your DBMS and your task, and how it affected the result.
  2. Measurement protocol:
    • Dataset: use the Stretch dataset.
    • Warm-up: perform 1 warm-up run (do not count it in statistics) and state whether the following runs are cold or warm.
      • Compare like with like (warm vs warm, cold vs cold).
    • Runs per measurement: run each measurement 10 times.
      • If the query is very fast (<50 ms), use 20 runs.
      • If the query is heavy (>5 s), 5 runs are acceptable — state this explicitly.
      • If time/resources are constrained, you may justify a different number of runs (≥5), but you must explain why.
    • Reported metrics: include median and p95 latency.
    • How to compute p95: sort the N timings ascending and take the element at index ceil(0.95 × N) with 1-based indexing (for N=10 this is the 10th value; for N=20 — the 19th).
    • DBMS-specific instrumentation:
      • MongoDB: explain(“executionStats”) on find/aggregate.
      • Cassandra: TRACING ON (capture the query trace; note consistency level if relevant).
      • Redis: measure per-operation time in series of ×100 identical operations; compute p95 over those 100 timings; SLOWLOG GET 10 if needed.
      • Neo4j: PROFILE (you may use EXPLAIN to inspect the plan without executing).
    • Report “before/after”: time and read volume/plan hits (docs/keys scanned, partitions touched, DB hits).

Examples (guidance)

After executing and optimizing, note any limitations/complexities (latency vs. data volume, bottlenecks as users grow) and propose at least one scaling idea.

Comparative Analysis

| Query | Selected DBMS | Why not others |

Scaling, schema changes, and “what if it doesn’t fit?”

For each implemented query or analytics, describe what limitations, compromises, or potential problems may arise when scaling, changing the schema, or increasing the data volume. What should be done if the chosen storage design makes a new, unexpected business query difficult or impossible?

Submission Instructions

Deadline