====== B4M36DS2, BE4M36DS2: Database Systems 2 ====== ===== Basic Information ===== * Annotations: [[https://www.fel.cvut.cz/cz/education/bk/predmety/47/02/p4702006.html|B4M36DS2]], [[https://www.fel.cvut.cz/en/education/bk/predmety/48/78/p4878406.html|BE4M36DS2]] (English) * Lecturer and tutor: **Yuliia Prokop ** * Schedule: [[https://intranet.fel.cvut.cz/cz/education/rozvrhy-ng.B251/public/html/predmety/47/02/p4702006.html|B4M36DS2]], [[https://intranet.fel.cvut.cz/cz/education/rozvrhy-ng.B251/public/html/predmety/48/78/p4878406.html|BE4M36DS2]] * Upload system: [[https://cw.felk.cvut.cz/upload/|BRUTE]] * VPN: [[https://svti.fel.cvut.cz/cz/services/vpn.html|FEL VPN]], [[https://ist.cvut.cz/nase-sluzby/pripojeni-z-domova-vpn|CTU VPN]] ===== Course Schedule and Materials ===== ^ # ^ date ^ Lecture ^ Practical class ^ HW deadline ^ | 1 | 22.09.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-01-introduction2025.pdf |Introduction: Big Data, NoSQL Databases}} | {{ :courses:b4m36ds2:b4m36ds2-lab-01-multimodel2025.pdf |Multi-Model Data & JSONB in PostgreSQL}} | | | 2 | 29.09.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-02-sharding2025.pdf |Basic principles: Scaling, Sharding, Replication}} {{ :courses:b4m36ds2:b4m36ds2-lecture-02-data_formats.pdf |Data Formats: XML, JSON, BSON, RDF}} | {{ :courses:b4m36ds2:b4m36ds2-lab-02-dataformats2025.pdf|Data Formats: XML, JSON, BSON, Parquet}} {{ :courses:b4m36ds2:ex2_csv_json_xml.txt| Exercise2 solutions}}| HW0 12.10.2025 | | 3 | 06.10.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-03-cap2025.pdf |CAP Theorem, Consistency}} | {{ :courses:b4m36ds2:b4m36ds2-lab-03-principles2025_solution.pdf |Basic principles: Scaling, Sharding, Replication}} [[https://drive.google.com/file/d/1Ua1XtFTA7L-XlHxHCccVwhA43wWyihtY/view | Dataset ]] | BA#1 21.12.2025 | | 4 | 13.10.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture4-mapreduce2025.pdf |Apache Hadoop: MapReduce, HDFS}} | {{ :courses:b4m36ds2:b4m36ds2-lab-04-pyspark2025.pdf |Apache Spark}} {{ :courses:b4m36ds2:dataset4.zip |Dataset}}| | | 5 | 20.10.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-05-types_of_nosql2025.pdf |Types of NoSQL databases}} | {{ :courses:b4m36ds2: |Types of NoSQL databases}} | HW1 26.10.2025 | | 6 | 27.10.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-06_-_redis2025.pdf |Key-value stores: Redis}} | {{ :courses:b4m36ds2:b4m36ds2-lab-05-redis2025.pdf |Redis}} | HW2 9.11.2025, HW4 23.11.2025 | | 7 | 03.11.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-7-mongodb2025.pdf |Document Databases: MongoDB. CRUD operations}} | {{ :courses:b4m36ds2:b4m36ds2-lab-07-mongodb2025.pdf |MongoDB. CRUD operations}} {{ :courses:b4m36ds2:dataset7.zip |Dataset}}| | | 8 | 10.11.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-8-mongodb2025.pdf |Document Databases: MongoDB. Aggregation}} | {{ :courses:b4m36ds2:b4m36ds2-lab-08-mongodb2025.pdf |MongoDB. Aggregation}} | HW3, HW4 23.11.2025 | | 9 | 17.11.2025 | State holiday | | | | 10 | 24.11.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-09-cassandra2025.pdf |Wide Column Stores: Cassandra}} | {{ :courses:be4m36ds2:b4m36ds2-lab-07-cassandra.pdf |Cassandra}} {{ :courses:be4m36ds2:solutions-9.txt |Solutions}}| HW5 7.12.2025 | | 11 | 01.12.2025 | {{ :courses:b4m36ds2:b4m36ds2-lecture-10_1-neo4j2025.pdf |Graph Databases: Traversal Framework}} {{ :courses:b4m36ds2:b4m36ds2-lecture-10_1-neo4j2025.pdf |Graph Databases: Neo4j: Cypher}}| {{ :courses:b4m36ds2:b4m36ds2-lecture-10_2-cypher2025.pdf |Neo4j. Cypher}} {{ :courses:b4m36ds2:data_solution.zip |Data and Solutions}}| HW6 14.12.2025 | | 12 | 08.12.2025 | {{ :courses:be4m36ds2:b4m36ds2-lecture-12-advanced.pdf |Advanced Aspects: Graph Databases, Performance Tuning}} | {{ :courses:b4m36ds2:b4m36ds2-lab-11-python-neo4j.pdf |Neo4j+Python}} {{ :courses:be4m36ds2:b4m36ds2-lab-12-neo4j-new.pdf |Neo4j+Java}} {{ :courses:b4m36ds2:examples-11.zip |Examples}}| | | 13 | 15.12.2025 | {{ :courses:be4m36ds2: |Transactions. Query optimization }} | {{ :courses:be4m36ds2: |Query optimization}} | HW7 21.12.2025, HW8 4.1.2026 | | 14 | 05.01.2025 | {{ :courses:be4m36ds2:b4m36ds2-lecture-12-sparql.pdf |RDF Stores. SPARQL}} | Demonstration of the project | | The course uses [[https://www.ksi.mff.cuni.cz/~svoboda/courses/211-B4M36DS2/| materials of Martin Svoboda]], the former lecturer of this course. ===== Tools for Practical Classes and Homework ===== JSON: [[https://codebeautify.org/jsonvalidator|JSON Validator]] Redis online editor: [[https://onecompiler.com/redis]] MongoDB: [[https://mongoplayground.net|Mongo Playground]] Neo4j: [[https://console.neo4j.org|Neo4j Console]] SPARQL: [[https://nosql.opendata.cz/sparql|SPARQL Query Editor]] ===== Formal Requirements ===== * **Attendance** during lectures and practical classes is **recommended** but not compulsory * **Homework** * Altogether, **8 mandatory individual homework assignments** will be given during the semester (HW0—HW7). They are parts of the same project * There are also two **optional homework assignments** that can be submitted for extra points * You can organize into **teams** of up to 3 students to work on the project * If a team carries out the project, the number of queries must be increased (see description of the project) * Solutions must be submitted via a script executed on the NoSQL server. The text report must be submitted to **BRUTE** * Solutions can be submitted repeatedly (before the deadline); only **the latest version is assessed** * Once the lecturer has assessed a given assignment, it cannot be resubmitted * A **delay** of one whole day **is penalized** by 0.4 points; shorter delays are penalized proportionally. A delay of two days is penalized by 0.8 points, etc., up to 2 points * All homework assignments must be submitted before the intended exam date to be considered * **Exam** * Only students who have already acquired course credit can sign up for the final exam * The **final exam** consists of a compulsory written test and an optional oral examination * Based on the result, everyone can voluntarily undergo an **oral examination**. The only condition is to have at least 50 total points * In such a case, the final score is further adjusted by **up to minus 10 to plus 5 points** * The examiner can also request an oral examination in case of uncertainties in the test ===== Assessment ===== The grade is calculated as the total points for homework, exam, and bonus points. ^ Assignment ^ Maximum ^ Minimum ^ | Homework | 42 | 25 | | Optional HW | 18 | 0 | | Tests | 10 | 5 | | Practice | 2 | 0 | | Exam | 30 | 20 | ^ Total ^ 102 ^ 50 ^ Final grade: **90 points and more for A, 80+ for B, 70+ for C, 60+ for D, and 50+ for E** ===== Exam Details ===== The exam will be conducted on computers in the classroom (KN E-328) using Moodle (https://moodle.fel.cvut.cz/course/view.php?id=8987) : * all theoretical concepts studied throughout the course; * work with four types of NoSQL systems: Redis, MongoDB, Cassandra, and Neo4j; * time limit is 60 minutes. The exam includes: * Questions with either a single correct answer (single-choice) or multiple correct answers (multiple-choice); * Open-ended questions. Additional notes: * Students may supplement their answers with handwritten illustrations or comments * Exam results will be posted on both Moodle and BRUTE within 1 day After receiving results, students may: * Request an oral examination (which can adjust their written exam score by ±5 points) * Register to retake the exam ===== Exam Requirements ===== ==== NoSQL Introduction ==== * **Big Data and NoSQL** terms, **V characteristics** (volume, variety, velocity, veracity, value), **current trends** and challenges (Big Data, Big Users, processing paradigms, ...), principles of **relational databases** (functional dependencies, normal forms, transactions, ACID properties); **types of NoSQL systems** (key-value, wide column, document, graph, ...), their data models, features and use cases; **common features** of NoSQL systems (aggregates, schemalessness, scaling, flexibility, sharding, replication, automated maintenance, eventual consistency, ...) ==== Data Formats ==== * **XML**: constructs (element, attribute, text, ...), content model (empty, text, elements, mixed), entities, well-formedness; document and data oriented XML * **JSON**: constructs (object, array, value), types of values (strings, numbers, ...); **BSON**: document structure (elements, type selectors, property names and values) * **RDF**: data model (resources, referents, values), triples (subject, predicate, object), statements, blank nodes, IRI identifiers, literals (types, language tags); graph representation (vertices, edges); **N-Triples notation** (RDF file, statements, triple components, literals, IRI references); **Turtle notation** (TTL file, prefix definitions, triples, object and predicate-object lists, blank nodes, prefixed names, literals) * **CSV**: constructs (document, header, record, field) ==== MapReduce ==== * **Programming models**, paradigms and languages; parallel programming models, process interaction (shared memory, message passing, implicit interaction), problem decomposition (task parallelism, data parallelism, implicit parallelism) * **MapReduce**: programming model (data parallelism, map and reduce functions), **cluster architecture** (master, workers, message passing, data distribution), **map and reduce functions** (input arguments, emission and reduction of intermediate key-value pairs, final output), **data flow phases** (mapping, shuffling, reducing), input parsing (input file, split, record), **execution steps** (parsing, mapping, partitioning, combining, merging, reducing), **combine function** (commutativity, associativity), additional functions (input reader, partition, compare, output writer), **implementation details** (counters, fault tolerance, stragglers, task granularity), usage patterns (aggregation, grouping, querying, sorting, ...) * Apache **Hadoop**: modules (Common, HDFS, YARN, MapReduce), related projects (Cassandra, HBase, ...); **HDFS**: data model (hierarchical namespace, directories, files, blocks, permissions), architecture (NameNode, DataNode, HeartBeat messages, failures), replica placement (rack-aware strategy), FsImage (namespace, mapping of blocks, system properties) and EditLog structures, FS commands (ls, mkdir, ...); **MapReduce**: architecture (JobTracker, TaskTracker), job implementation (Configuration; Mapper, Reducer, and Combiner classes; Context, write method; Writable and WritableComparable interfaces), job execution schema ==== NoSQL Principles ==== * **Scaling**: scalability definition; **vertical scaling** (scaling up/down), pros and cons (performance limits, higher costs, vendor lock-in, ...); **horizontal scaling** (scaling out/in), pros and cons, **network fallacies** (reliability, latency, bandwidth, security, ...), **cluster** architecture; design questions (scalability, availability, consistency, latency, durability, resilience) * **Distribution** models: **sharding**: idea, motivation, objectives (balanced distribution, workload, ...), strategies (mapping structures, general rules), difficulties (evaluation of requests, changing cluster structure, obsolete or incomplete knowledge, network partitioning, ...); **replication**: idea, motivation, objectives, replication factor, architectures (master-slave and peer-to-peer), internal details (handling of read and write requests, consistency issues, failure recovery), replica placement strategies; mutual combinations of **sharding and replication** * **CAP** theorem: CAP guarantees (consistency, availability, partition tolerance), CAP theorem, consequences (CA, CP and AP systems), consistency-availability spectrum, **ACID properties** (atomicity, consistency, isolation, durability), **BASE properties** (basically available, soft state, eventual consistency) * **Consistency**: strong vs. eventual consistency; **write consistency** (write-write conflict, context, pessimistic and optimistic strategies), **read consistency** (read-write conflict, context, inconsistency window, session consistency), **read and write quora** (formulae, motivation, workload balancing) ==== Key-Value Stores ==== * Data model (key-value pairs), **key management** (real-world identifiers, automatically generated, structured keys, prefixes), basic CRUD operations, use cases, representatives, extended functionality (MapReduce, TTL, links, structured store, ...) * **Redis**: features (in-memory, data structure store), data model (databases, objects), data types (string, list, set, sorted set, hash), **string** commands (SET, GET, APPEND, SETRANGE, INCR, DEL, ...), **list** commands (LPUSH, RPUSH, LPOP, RPOP, LINDEX, LRANGE, LREM, ...), **set** commands (SADD, SISMEMBER, SUNION, SINTER, SDIFF, SREM, ...), **sorted set** commands (ZADD, ZRANGE, ZRANGEBYSCORE, ZINCRBY, ZREM, ...), **hash** commands (HSET, HMSET, HGET, HMGET, HKEYS, HVALS, HDEL, ...), **general** commands (EXISTS, KEYS, DEL, RENAME, ...), **time-to-live** commands (EXPIRE, TTL, PERSIST) ==== Wide Column Stores ==== * Data model (column families, rows, columns), query patterns, use cases, representatives * **Cassandra**: data model (keyspaces, tables, rows, columns), primary keys (partition key, clustering columns), column values (missing; empty; native data types, tuples, user-defined types; collections: lists, sets, maps; frozen mode), additional data (TTL, timestamp); **CQL** language: DDL statements: **CREATE KEYSPACE** (replication options), DROP KEYSPACE, USE keyspace, **CREATE TABLE** (column definitions, usage of types, primary key), DROP TABLE, TRUNCATE TABLE; native data types (int, varint, double, boolean, text, timestamp, ...); literals (atomic, collections, ...); DML statements: **SELECT** statements (SELECT, FROM, WHERE, GROUP BY, ORDER BY, and LIMIT clauses; DISTINCT modifier; selectors; non/filtering queries, ALLOW FILTERING mode; filtering relations; aggregates; restrictions on sorting and aggregation), **INSERT** statements (update parameters: TTL, TIMESTAMP), **UPDATE** statements (assignments; modification of collections: additions, removals), **DELETE** statements (deletion of rows, removal of columns, removal of items from collections) ==== Document Stores ==== * Data model (documents), query patterns, use cases, representatives * **MongoDB**: data model (databases, collections, documents, field names), document identifiers (features, ObjectId), data modeling (embedded documents, references); CRUD operations (insert, update, save, remove, find); **insert** operation (management of identifiers); **update** operation: replace vs. update mode, multi option, upsert mode, update operators (field: set, rename, inc, ...; array: push, pop, ...); **save** operation (insert vs. replace mode); **remove** operation (justOne option); **find** operation: query conditions (value equality vs. query operators), query operators (comparison: eq, ne, ...; element: exists; evaluation: regex, ...; logical: and, or, not; array: all, elemMatch, ...), dot notation (embedded fields, array items), querying of arrays, projection (positive, negative), projection operators (array: slice, elemMatch), **modifiers** (sort, skip, limit); **MapReduce** (map function, reduce function, options: query, sort, limit, out); primary and secondary **index structures** (index types: value, hashed, ...; forms; properties: unique, partial, sparse, TTL) ==== Graph Databases ==== * Data model (property graphs), use cases, representatives * **Neo4j**: data model (graph, nodes, relationships, directions, labels, types, properties), properties (fields, atomic values, arrays); embedded database mode; **traversal framework**: traversal description, **order** (breadth-first, depth-first, branch ordering policies), **expanders** (relationship types, directions), **uniqueness** (NODE_GLOBAL, RELATIONSHIP_GLOBAL, ...), **evaluators** (INCLUDE/EXCLUDE and CONTINUE/PRUNE results; predefined evaluators: all, excludeStartPosition, ...; custom evaluators: evaluate method), traverser (starting nodes, iteration modes: paths, end nodes, last relationships); Java interface (labels, types, nodes, relationships, properties, transactions); **Cypher** language: graph matching (solutions, variable bindings); query sub/clauses (read, write, general); **path patterns**, node patterns (variable, labels, properties), relationship patterns (variable, types, properties, variable length); **MATCH** clause (path patterns, WHERE conditions, uniqueness requirement, OPTIONAL mode); **RETURN** clause (DISTINCT modifier, ORDER BY, LIMIT, SKIP subclauses, aggregation); **WITH** clause (motivation, subclauses); **write clauses**: CREATE, DELETE (DETACH mode), SET (properties, labels), REMOVE (properties, labels); query structure (chaining of clauses, query parts, restrictions) ==== Advanced Aspects ==== * **Graph databases**: non/transactional databases, query patterns (CRUD, graph algorithms, graph traversals, graph pattern matching, similarity querying); data **structures** (adjacency matrix, adjacency list, incidence matrix, Laplacian matrix), graph **traversals**, **data locality** (BFS layout, matrix bandwidth, bandwidth minimization problem, Cuthill-McKee algorithm), **graph partitioning** (1D partitioning, 2D partitioning, BFS evaluation), graph matching (sub-graph, super-graph patterns), non/mining based indexing * **Performance tuning**: scalability goals (reduce latency, increase throughput), Amdahl's law, Little's law, message cost model * **Polyglot persistence** ==== RDF Stores ==== * **Linked Data**: principles (identification, standard formats, interlinking, open license), Linked Open Data Cloud * **SPARQL**: graph pattern matching (solution sequence, solution, variable binding, compatibility of solutions), **graph patterns** (basic, group, optional, alternative, graph, minus); **prologue declarations** (BASE, PREFIX clauses), **SELECT** queries (SELECT, FROM, and WHERE clauses), query **dataset** (default graph, named graphs), **variable assignments** (BIND), **FILTER** constraints (comparisons, logical connectives, accessors, tests, ...), **solution modifiers** (DISTINCT, REDUCED; aggregation: GROUP BY, HAVING; sorting: ORDER BY, LIMIT, OFFSET), **query forms** (SELECT, ASK, DESCRIBE, CONSTRUCT) ===== Recommended Literature ===== * Holubová, Irena - Kosek, Jiří - Minařík, Karel - Novák, David: **Big Data a NoSQL databáze**.\\ ISBN: 978-80-247-5466-6 (hardcover), 978-80-247-5938-8 (eBook PDF), 978-80-247-5939-5 (eBook EPUB).\\ Grada Publishing, a.s., 2015. * Sadalage, Pramod J. - Fowler, Martin: [[http://martinfowler.com/books/nosql.html|NoSQL Distilled]].\\ ISBN: 978-0-321-82662-6.\\ Pearson Education, Inc., 2013. * Wiese, Lena: *[[https://www.degruyterbrill.com/document/doi/10.1515/9783110441413/html|Advanced Data Management: For SQL, NoSQL, Cloud and Distributed Databases]].\\ ISBN: 978-3-11-044140-6 (hardcover), 978-3-11-044141-3 (eBook PDF), 978-3-11-043307-4 (eBook EPUB).\\ DOI: [[http://doi.org/10.1515/9783110441413|10.1515/9783110441413]].\\ Walter de Gruyter GmbH, 2015. * Zomaya, Albert Y. - Sakr, Sherif: [[http://www.springer.com/gp/book/9783319493398|Handbook of Big Data Technologies]].\\ ISBN: 978-3-319-49339-8 (hardcover), 978-3-319-49340-4 (eBook).\\ DOI: [[http://doi.org/10.1007/978-3-319-49340-4|10.1007/978-3-319-49340-4]].\\ Springer International Publishing AG, 2017.