(10 points)
For each NoSQL DBMS (MongoDB, Cassandra, Neo4j), after completing the target analytical queries:
Analyze the performance of one of the queries (by execution time, volume of data scanned).
Implement one technique to improve performance (for example: creating an index, changing key structures/partitioning, optimizing the storage schema, reducing the volume of returned data, etc.).
Compare the result before and after optimization and explain.
Briefly describe in the report what you did, why this technique is applicable specifically for your DBMS and your task, and how it affected the result.
Measurement protocol:
Dataset: use the Stretch dataset.
DBMS-specific instrumentation:
MongoDB: explain(“executionStats”) on find/aggregate.
Cassandra: TRACING ON (capture the query trace; note consistency level if relevant).
Neo4j: PROFILE (you may use EXPLAIN to inspect the plan without executing).
Report “before/after”: time and read volume/plan hits (docs/keys scanned, partitions touched, DB hits).
Examples (guidance)
MongoDB
Add indexes on key fields (e.g., user_id, date, action type).
Compare execution time with vs. without the index (record results).
Assess impact of document nesting vs. flatter schema (multikey costs).
Prefer compound index order: equality → range → sort; use covered queries where possible.
Cassandra
Review partitioning (how partition/clustering keys affect latency).
Compare latency across key designs (the partition key must drive queries).
Assess batch size on inserts (prefer batches to the same partition).
Neo4j
Evaluate how graph size and traversal depth impact response time.
Use indexes or uniqueness constraints where applicable.
Compare a simple query vs. MATCH + WHERE on an indexed property.
Capture DB Hits and avoid NodeByLabelScan/CartesianProduct where possible.
If your query is already optimal, prove it with DBMS tools (MongoDB explain(“executionStats”), Neo4j PROFILE, Cassandra TRACING ON) and show the plan/read metrics. Then apply a small measurable optimization (e.g., projection/covered query, refine an index/constraint, reduce touched partitions/columns) and compare before vs. after (time + read volume/DB hits). If improvement is tiny, that’s OK - explain why the original was already near-optimal and what changed.
Comparative Analysis
Compare how easily/quickly/conveniently your queries were executed in each DBMS, and which limitations or surprises you encountered.
Analyze the optimization techniques applied.
Create a mini-table (for the queries you actually implemented):
| DBMS | Query / task | Optimization technique | Median_before | Median_after | Read volume before | Read volume after |
Scaling, schema changes, and “what if it doesn’t fit?”
Briefly describe, for each implemented query or analytics, what limitations, compromises, or potential problems may arise when scaling, changing the schema, or increasing data volume. If a new, unexpected business query becomes difficult or impossible with the chosen design, state what you would do (e.g., add a query‑driven table; adjust partition keys; precompute rollups; introduce a specialized store).
Submission Instructions
Deadline