====== B4M36DS2, BE4M36DS2: Database Systems 2 ======


===== Basic Information =====

  * Annotations: [[https://www.fel.cvut.cz/cz/education/bk/predmety/47/02/p4702006.html|B4M36DS2]], [[https://www.fel.cvut.cz/en/education/bk/predmety/48/78/p4878406.html|BE4M36DS2]] (English)
  * Lecturer and tutor: **Yuliia Prokop <prokoyul@fel.cvut.cz>**
  * Schedule: [[https://intranet.fel.cvut.cz/cz/education/rozvrhy-ng.B251/public/html/predmety/47/02/p4702006.html|B4M36DS2]], [[https://intranet.fel.cvut.cz/cz/education/rozvrhy-ng.B251/public/html/predmety/48/78/p4878406.html|BE4M36DS2]]
  * Upload system: [[https://cw.felk.cvut.cz/upload/|BRUTE]]
  * VPN: [[https://svti.fel.cvut.cz/cz/services/vpn.html|FEL VPN]], [[https://ist.cvut.cz/nase-sluzby/pripojeni-z-domova-vpn|CTU VPN]]

===== Course Schedule and Materials =====

^ # ^ date ^ Lecture ^ Practical class ^ HW deadline ^
| 1  | 22.09.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-01-introduction2025.pdf |Introduction: Big Data, NoSQL Databases}} | {{ :courses:b4m36ds2:b4m36ds2-lab-01-multimodel2025.pdf |Multi-Model Data & JSONB in PostgreSQL}} |  |
| 2  | 29.09.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-02-sharding2025.pdf |Basic principles: Scaling, Sharding, Replication}}  {{ :courses:b4m36ds2:b4m36ds2-lecture-02-data_formats.pdf |Data Formats: XML, JSON, BSON, RDF}} | {{ :courses:b4m36ds2:b4m36ds2-lab-02-dataformats2025_.pdf|Data Formats: XML, JSON, BSON, Parquet}}  {{ :courses:b4m36ds2:ex2_csv_json_xml.txt| Exercise2 solutions}}| HW0 12.10.2025 |
| 3  | 06.10.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-03-cap2025.pdf |CAP Theorem, Consistency}} | {{ :courses:b4m36ds2:b4m36ds2-lab-03-principles2025_solution.pdf |Basic principles: Scaling, Sharding, Replication}} [[https://drive.google.com/file/d/1Ua1XtFTA7L-XlHxHCccVwhA43wWyihtY/view | Dataset ]] | BA#1 21.12.2025 |
| 4  | 13.10.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture4-mapreduce2025.pdf |Apache Hadoop: MapReduce, HDFS}} | {{ :courses:b4m36ds2:b4m36ds2-lab-04-pyspark2025.pdf |Apache Spark}} {{ :courses:b4m36ds2:dataset4.zip |Dataset}} {{ :courses:b4m36ds2:pyspark_-_colab.pdf |PySpark}} {{ :courses:b4m36ds2:b4m36ds2-lab-05-mapreduce.pdf |MapReduce, HDFS}} |  |
| 5  | 20.10.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-05-types_of_nosql2025.pdf |Types of NoSQL databases}} | {{ :courses:b4m36ds2:b4m36ds2-lab-05-types2025.pdf |Types of NoSQL databases}} | HW1 26.10.2025 |
| 6  | 27.10.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-06_-_redis2025.pdf |Key-value stores: Redis}} | {{ :courses:b4m36ds2:b4m36ds2-lab-05_new-redis2025.pdf |Redis}} | HW2 9.11.2025, HW4 23.11.2025 |
| 7  | 03.11.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-7-2025-mongodb.pdf |Document Databases: MongoDB. CRUD operations}}  | {{ :courses:b4m36ds2:b4m36ds2-lab-07-mongodb2025.pdf |MongoDB. CRUD operations}} {{ :courses:b4m36ds2:data_solution.zip |Dataset+Solution}}|  |
| 8  | 10.11.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-8-mongodb2025.pdf |Document Databases: MongoDB. Aggregation}} | {{ :courses:b4m36ds2:b4m36ds2-lab-08-mongodb2025.pdf |MongoDB. Aggregation}} {{ :courses:b4m36ds2:dataset8_solutions.zip |Dataset+Solution}} | HW3, HW4 <del>23.11.2025</del> 30.11.2025 |
| 9  | 17.11.2025 | State holiday |  |  |
| 10 | 24.11.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-09-cassandra2025.pdf |Wide Column Stores: Cassandra}} | {{ :courses:be4m36ds2:b4m36ds2-lab-09-cassandra.pdf |Cassandra}} {{ :courses:be4m36ds2:solutions-9.txt |Solutions}}| HW5 7.12.2025 |
| 11 | 01.12.2025 | {{:courses:b4m36ds2:b4m36ds2-lecture-10_1-neo4j2025.pdf|Graph Databases: Traversal Framework}} {{:courses:b4m36ds2:b4m36ds2-lecture-10_2-cypher2025.pdf|Graph Databases: Neo4j.Cypher}} | **<fc #ff0000>Test (Redis, MongoDB)</fc>** {{:courses:b4m36ds2:b4m36ds2-lab-10-neo4j2025.pdf|Neo4j.Cypher}} {{ :courses:be4m36ds2:data_solution10.zip |Data + Solutions}}| HW6 14.12.2025 |
| 12 | 08.12.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-11-advanced2025.pdf |Advanced Aspects: Graph Databases, Performance Tuning}} | {{ :courses:b4m36ds2:b4m36ds2-lab-11-python-neo4j.pdf |Neo4j+Python}} {{ :courses:be4m36ds2:b4m36ds2-lab-12-neo4j-new.pdf |Neo4j+Java}} {{ :courses:b4m36ds2:examples-11.zip |Examples}}|  |
| 13 | 15.12.2025 | {{ :courses:be4m36ds2:b4m36ds2-lecture-12-optimization.pdf |Transactions. Query optimization }} | {{ :courses:b4m36ds2:b4m36ds2-lab-12-optimization.pdf |Query optimization}} {{ :courses:b4m36ds2:data_solution12.zip |Data & Solution}}|  HW7 28.12.2025, HW8 4.1.2026 |
| 14 | 05.01.2025 | {{ :courses:be4m36ds2:b4m36ds2-lecture-12-sparql.pdf |RDF Stores. SPARQL}} | Demonstration of the project |  |

The course uses [[https://www.ksi.mff.cuni.cz/~svoboda/courses/211-B4M36DS2/| materials of Martin Svoboda]], the former lecturer of this course.
===== Tools for Practical Classes and Homework =====

JSON: [[https://codebeautify.org/jsonvalidator|JSON Validator]]

Redis online editor: [[https://onecompiler.com/redis |Redis OneCompiler]]

MongoDB: [[https://mongoplayground.net|Mongo Playground]] [[https://onecompiler.com/mongodb |MongoDB OneCompiler]]

Neo4j: [[http://sandbox.neo4j.com/|Neo4j Sandbox]] [[https://neo4j.com/product/auradb/ | Neo4j AuraDB]]

SPARQL: [[https://nosql.opendata.cz/sparql|SPARQL Query Editor]]

PySpark: [[https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html|PySpark functions]]

===== Formal Requirements =====

  * **Attendance** during lectures and practical classes is **recommended** but not compulsory

  * **Homework**
    * Altogether, **8 mandatory individual homework assignments** will be given during the semester (HW0—HW7). They are parts of the same project
    * There are also two **optional homework assignments** that can be submitted for extra points  
    * You can organize into **teams** of up to 3 students to work on the project
    * If a team carries out the project, the number of queries must be increased (see description of the project)
    * Solutions must be submitted via a script executed on the NoSQL server. The text report must be submitted to **BRUTE**
    * Solutions can be submitted repeatedly (before the deadline); only **the latest version is assessed**
    * Once the lecturer has assessed a given assignment, it cannot be resubmitted
    * A **delay** of one whole day **is penalized** by 0.4 points; shorter delays are penalized proportionally. A delay of two days is penalized by 0.8 points, etc., up to 2 points
    * All homework assignments must be submitted before the intended exam date to be considered

  * **Exam**
    * Only students who have already acquired course credit can sign up for the final exam
    * The **final exam** consists of a compulsory written test and an optional oral examination
    * Based on the result, everyone can voluntarily undergo an **oral examination**. The only condition is to have at least 50 total points
    * In such a case, the final score is further adjusted by **up to minus 10 to plus 5 points**
    * The examiner can also request an oral examination in case of uncertainties in the test
===== Assessment =====
The grade is calculated as the total points for homework, exam, and bonus points. 

^ Assignment       ^ Maximum ^ Minimum ^
| Homework          |  42  |  25  |
| Optional HW       |  18  |  0  |
| Tests             |  10  |  5  |
| Practice          |  2   |  0  |
| Exam              |  30  |  20  |
^ Total            ^  102  ^  50  ^

Final grade: **90 points and more for A, 80+ for B, 70+ for C, 60+ for D, and 50+ for E**
===== Exam Details =====
The exam will be conducted on computers in the classroom (KN E-328) using Moodle (https://moodle.fel.cvut.cz/course/view.php?id=9765)

The pre-term exam (8. 1. 2026) will be held in the KN E-311 classroom.

During the exam: Internet access will be disabled on the exam computers. Students will have access only to:
  * Moodle (where all answers must be submitted), and
  * a server environment where they can test query execution.

Time limit is 90 minutes.

The exam evaluates:
  * understanding of the theoretical concepts covered in the course (NoSQL principles, distribution/consistency, data formats, modeling trade-offs, basic performance intuition);
  * ability to choose an appropriate NoSQL system for a given workload and justify the choice;
  * ability to design/outline a data model and reason about access patterns;
  * ability to analyze given queries/solutions (find mistakes, predict results, propose fixes).

The exam consists of three task types:
  - **Theory quizzes**
    * Single-choice, multiple-choice, and short-answer questions.
  - **Theoretical case tasks** (modeling / system selection)
    * Given requirements and constraints, students propose a database type and outline a data model (structures, keys, main entities/relations).
    * Full syntactic accuracy is not required; queries may be written schematically.
    * Some tasks may ask to sketch one structure (e.g., table/collection) and provide one example insert.
  - **Practical task** (MongoDB)
    * A MongoDB scenario requiring a full workflow: design the data structure, insert the provided data, and write queries against the created database.

Allowed materials:
  * Students may bring printed documentation with command syntax.

Additional notes:
  * Students may add handwritten sketches/comments if helpful
  * Exam results will be posted on both Moodle and BRUTE within 1 day

After receiving results, students may:
  * Request an oral examination (which can adjust their written exam score by ±5 points)
  * Register to retake the exam
===== Exam Requirements =====

==== NoSQL Introduction ====

  * **Big Data and NoSQL** terms, **V characteristics** (volume, variety, velocity, veracity, value), **current trends** and challenges (Big Data, Big Users, processing paradigms, ...), principles of **relational databases** (functional dependencies, normal forms, transactions, ACID properties); **types of NoSQL systems** (key-value, wide column, document, graph, ...), their data models, features and use cases; **common features** of NoSQL systems (aggregates, schemalessness, scaling, flexibility, sharding, replication, automated maintenance, eventual consistency, tunable consistency, ...); “query-driven” / access-pattern-driven data modeling; system selection by workload requirements.

==== Data Formats ====

  * **XML**: constructs (element, attribute, text, ...), content model (empty, text, elements, mixed), entities, well-formedness; document and data oriented XML
  * **JSON**: constructs (object, array, value), types of values (strings, numbers, ...); missing vs null fields and their impact on queries; **BSON**: document structure (elements, type selectors, property names and values)
  * **RDF**: data model (resources, referents, values), triples (subject, predicate, object), statements, blank nodes, IRI identifiers, literals (types, language tags); graph representation (vertices, edges); **N-Triples notation** (RDF file, statements, triple components, literals, IRI references); **Turtle notation** (TTL file, prefix definitions, triples, object and predicate-object lists, blank nodes, prefixed names, literals)
  * **CSV**: constructs (document, header, record, field)

==== MapReduce ====

  * **Programming models**, paradigms and languages; parallel programming models, process interaction (shared memory, message passing, implicit interaction), problem decomposition (task parallelism, data parallelism, implicit parallelism)
  * **MapReduce**: programming model (data parallelism, map and reduce functions), **cluster architecture** (master, workers, message passing, data distribution), **map and reduce functions** (input arguments, emission and reduction of intermediate key-value pairs, final output), **data flow phases** (mapping, shuffling, reducing), input parsing (input file, split, record), **execution steps** (parsing, mapping, partitioning, combining, merging, reducing), **combine function** (commutativity, associativity), additional functions (input reader, partition, compare, output writer), **implementation details** (counters, fault tolerance, stragglers, task granularity), usage patterns (aggregation, grouping, querying, sorting, ...)
  * Apache **Hadoop**: modules (Common, HDFS, YARN, MapReduce), related projects (Cassandra, HBase, ...); **HDFS**: data model (hierarchical namespace, directories, files, blocks, permissions), architecture (NameNode, DataNode, HeartBeat messages, failures), replica placement (rack-aware strategy), FsImage (namespace, mapping of blocks, system properties) and EditLog structures, FS commands (ls, mkdir, ...); **MapReduce**: execution in YARN, job implementation (Configuration; Mapper, Reducer, and Combiner classes; Context, write method; Writable and WritableComparable interfaces), job execution schema

==== NoSQL Principles ====

  * **Scaling**: scalability definition; **vertical scaling** (scaling up/down), pros and cons (performance limits, higher costs, vendor lock-in, ...); **horizontal scaling** (scaling out/in), pros and cons, **network fallacies** (reliability, latency, bandwidth, security, ...), **cluster** architecture; design questions (scalability, availability, consistency, latency, durability, resilience)
  * **Distribution** models: **sharding**: idea, motivation, objectives (balanced distribution, workload, ...), strategies (mapping structures, general rules), difficulties (evaluation of requests, changing cluster structure, obsolete or incomplete knowledge, network partitioning, hotspots and rebalancing, ...); **replication**: idea, motivation, objectives, replication factor, architectures (master-slave and peer-to-peer), internal details (handling of read and write requests, consistency issues, failure recovery), replica placement strategies; mutual combinations of **sharding and replication**
  * **CAP** theorem: CAP guarantees (consistency, availability, partition tolerance), CAP theorem, consequences (CA, CP and AP systems), consistency-availability spectrum, **ACID properties** (atomicity, consistency, isolation, durability), **BASE properties** (basically available, soft state, eventual consistency)
  * **Consistency**: strong vs. eventual consistency; **write consistency** (write-write conflict, context, pessimistic and optimistic strategies), **read consistency** (read-write conflict, context, inconsistency window, session consistency), **read and write quora** (formulae, motivation, workload balancing); **idempotency** for retries (avoiding duplicates)

==== Key-Value Stores ====

  * Data model (key-value pairs), **key management** (real-world identifiers, automatically generated, structured keys, prefixes), basic CRUD operations, use cases, representatives, TTL, ...
  * **Redis**: features (in-memory, data structure store), data model (databases, objects), data types (string, list, set, sorted set, hash), **string** commands (SET, GET, INCR, DEL, ...), **list** commands (LPUSH, RPUSH, LPOP, RPOP, LINDEX, LRANGE, ...), **set** commands (SADD, SISMEMBER, SUNION, SINTER, SDIFF, SREM, ...), **sorted set** commands (ZADD, ZRANGE, ZRANGEBYSCORE, ZINCRBY, ZREM, ZUNIONSTORE, ...), **hash** commands (HSET, HMSET, HGET, HMGET, HKEYS, HVALS, HDEL, ...), **general** commands (EXISTS, DEL, RENAME, ...), **time-to-live** commands (EXPIRE, TTL, PERSIST); performance/optimization: avoiding full scans (e.g., keyspace-wide operations), choosing structure by query pattern

==== Wide Column Stores ====

  * Data model (column families, rows, columns), query patterns, use cases, representatives, query-driven modeling: partition key + clustering columns chosen from query patterns
  * **Cassandra**: data model (keyspaces, tables, rows, columns), primary keys (partition key, clustering columns), column values (missing; empty; native data types, tuples, user-defined types; collections: lists, sets, maps; frozen mode), additional data (TTL, timestamp; tombstones as a performance concern (conceptual)); **CQL** language: DDL statements: **CREATE KEYSPACE** (replication options), DROP KEYSPACE, USE keyspace, **CREATE TABLE** (column definitions, usage of types, primary key), DROP TABLE, TRUNCATE TABLE; native data types (int, varint, double, boolean, text, timestamp, ...); literals (atomic, collections, ...); DML statements: **SELECT** statements (SELECT, FROM, WHERE, GROUP BY, ORDER BY, and LIMIT clauses; DISTINCT modifier; selectors; non/filtering queries, ALLOW FILTERING mode; filtering relations; aggregates; restrictions on sorting and aggregation, anti-patterns: ALLOW FILTERING, unbounded partitions), **INSERT** statements (update parameters: TTL, TIMESTAMP), **UPDATE** statements (assignments; modification of collections: additions, removals), **DELETE** statements (deletion of rows, removal of columns, removal of items from collections)

==== Document Stores ====

  * Data model (documents), query patterns, use cases, representatives, aggregates; embedding vs referencing decided by query patterns and update cost
  * **MongoDB**: data model (databases, collections, documents, field names), document identifiers (features, ObjectId), data modeling (embedded documents, references); CRUD operations (insert, update, save, remove, find); **insert** operation (management of identifiers); **update** operation: replace vs. update mode, multi option, upsert mode, update operators (field: set, rename, inc, ...; array: push, pop, ...); **save** operation (insert vs. replace mode); **remove** operation (justOne option); **find** operation: query conditions (value equality vs. query operators), query operators (comparison: eq, ne, ...; element: exists; evaluation: regex, ...; logical: and, or, not; array: all, elemMatch, ...), dot notation (embedded fields, array items), querying of arrays, projection (positive, negative), projection operators (array: slice, elemMatch), **modifiers** (sort, skip, limit); Aggregation framework (pipeline stages; typical optimization: $match early); primary and secondary **index structures** (index types: value, hashed, ...; forms; properties: unique, partial, sparse, TTL; compound index field order for filter+sort; using explain plans (conceptual))

==== Graph Databases ====

  * Data model (property graphs), use cases, representatives, multi-hop traversals; relationship-rich domains; when graph is a better fit than joins/embedding
  * **Neo4j**: data model (graph, nodes, relationships, directions, labels, types, properties), properties (fields, atomic values, arrays); embedded database mode; **traversal framework**: traversal description, order, expanders, uniqueness, evaluators; predefined evaluators: all, excludeStartPosition, ...; custom evaluators: evaluate method; traverser; **Cypher** language: graph matching (solutions, variable bindings); query sub/clauses (read, write, general); **path patterns**, node patterns (variable, labels, properties), relationship patterns (variable, types, properties, variable length); **MATCH** clause (path patterns, WHERE conditions, uniqueness requirement, OPTIONAL mode); **RETURN** clause (DISTINCT modifier, ORDER BY, LIMIT, SKIP subclauses, aggregation); **WITH** clause (motivation, subclauses); **write clauses**: CREATE, DELETE (DETACH mode), SET (properties, labels), REMOVE (properties, labels); query structure (chaining of clauses, query parts, restrictions); query optimization (EXPLAIN/PROFILE; start from selective nodes via indexes/constraints; avoid unconstrained variable-length expansions)

==== Advanced Aspects ====

  * **Graph databases**: non/transactional databases, query patterns (CRUD, graph algorithms, graph traversals, graph pattern matching, similarity querying); data **structures** (adjacency matrix, adjacency list, incidence matrix, Laplacian matrix), graph **traversals**, **data locality** (BFS layout, matrix bandwidth, bandwidth minimization problem, Cuthill-McKee algorithm), **graph partitioning** (1D partitioning, 2D partitioning, BFS evaluation), graph matching (sub-graph, super-graph patterns), non/mining based indexing
  * **Performance tuning**: scalability goals (reduce latency, increase throughput), Amdahl's law, Little's law, message cost model; reading query plans (EXPLAIN/PROFILE) as a practical optimization tool
  * **Polyglot persistence** (selecting the right system for a task; trade-offs; combining systems)


==== RDF Stores ====

  * **Linked Data**: principles (identification, standard formats, interlinking, open license), Linked Open Data Cloud
  * **SPARQL**: graph pattern matching (solution sequence, solution, variable binding, compatibility of solutions), **graph patterns** (basic, group, optional, alternative, graph, minus); **prologue declarations** (BASE, PREFIX clauses), **SELECT** queries (SELECT, FROM, and WHERE clauses), query **dataset** (default graph, named graphs), **variable assignments** (BIND), **FILTER** constraints (comparisons, logical connectives, accessors, tests, ...), **solution modifiers** (DISTINCT, REDUCED; aggregation: GROUP BY, HAVING; sorting: ORDER BY, LIMIT, OFFSET), **query forms** (SELECT, ASK, DESCRIBE, CONSTRUCT)

===== Recommended Literature =====

  * Holubová, Irena - Kosek, Jiří - Minařík, Karel - Novák, David: **Big Data a NoSQL databáze**.\\
ISBN: 978-80-247-5466-6 (hardcover), 978-80-247-5938-8 (eBook PDF), 978-80-247-5939-5 (eBook EPUB).\\
Grada Publishing, a.s., 2015.
  * Sadalage, Pramod J. - Fowler, Martin: [[http://martinfowler.com/books/nosql.html|NoSQL Distilled]].\\
ISBN: 978-0-321-82662-6.\\
Pearson Education, Inc., 2013.
  * Wiese, Lena: *[[https://www.degruyterbrill.com/document/doi/10.1515/9783110441413/html|Advanced Data Management: For SQL, NoSQL, Cloud and Distributed Databases]].\\
ISBN: 978-3-11-044140-6 (hardcover), 978-3-11-044141-3 (eBook PDF), 978-3-11-043307-4 (eBook EPUB).\\
DOI: [[http://doi.org/10.1515/9783110441413|10.1515/9783110441413]].\\
Walter de Gruyter GmbH, 2015.
  * Zomaya, Albert Y. - Sakr, Sherif: [[http://www.springer.com/gp/book/9783319493398|Handbook of Big Data Technologies]].\\
ISBN: 978-3-319-49339-8 (hardcover), 978-3-319-49340-4 (eBook).\\
DOI: [[http://doi.org/10.1007/978-3-319-49340-4|10.1007/978-3-319-49340-4]].\\
Springer International Publishing AG, 2017.