====== B4M36DS2, BE4M36DS2: Database Systems 2 ====== ===== Basic Information ===== * Annotations: [[https://www.fel.cvut.cz/cz/education/bk/predmety/47/02/p4702006.html|B4M36DS2]], [[https://www.fel.cvut.cz/en/education/bk/predmety/48/78/p4878406.html|BE4M36DS2]] (English) * Lecturer and tutor: **Yuliia Prokop ** * Schedule: [[https://intranet.fel.cvut.cz/cz/education/rozvrhy-ng.B241/public/html/predmety/47/02/p4702006.html|B4M36DS2]], [[https://intranet.fel.cvut.cz/cz/education/rozvrhy-ng.B241/public/html/predmety/48/78/p4878406.html|BE4M36DS2]] * Upload system: [[https://cw.felk.cvut.cz/upload/|BRUTE]] * VPN: [[https://svti.fel.cvut.cz/cz/services/vpn.html|FEL VPN]], [[https://ist.cvut.cz/nase-sluzby/pripojeni-z-domova-vpn|CTU VPN]] * **Exams:** * Tuesday, 14. 1. 2025, 10:00-13:00, KN:E-328 * Wednesday, 15. 1. 2025, 10:00-13:00, KN:E-328 * Tuesday, 21. 1. 2025, 10:00-13:00, KN:E-328 * Wednesday, 22. 1. 2025, 10:00-13:00, KN:E-328 * Monday, 27. 1. 2025, 10:00-13:00, KN:E-328 * Wednesday, 29. 1. 2025, 10:00-13:00, KN:E-328 * Monday, 10. 2. 2025, 10:00-13:00, KN:E-328 ===== Course Schedule and Materials ===== ^ # ^ date ^ Lecture ^ Practical class ^ HW deadline ^ | 1 | 23.09.2024 | {{ :courses:b4m36ds2:b4m36ds2-lecture-01-introduction.pdf |Introduction: Big Data, NoSQL Databases}} | {{ :courses:be4m36ds2:b4m36ds2-lab-01-introduction.pdf |Relational DBS vs NoSQL DBS}} | | | 2 | 30.09.2024 | {{ :courses:b4m36ds2:b4m36ds2-lecture-02-types_of_nosql.pdf |Types of NoSQL data stores. Polyglot persistence}} {{ :courses:b4m36ds2:b4m36ds2-lecture-02-data_formats.pdf |Data Formats: XML, JSON, BSON, RDF}} | {{ :courses:b4m36ds2:b4m36ds2-lab-01-formats.pdf|Types of NoSQL data stores. Polyglot persistence}} | HW0 6.10.2024 | | 3 | 07.10.2024 | {{ :courses:b4m36ds2:b4m36ds2-lecture-03-principles.pdf |Basic Principles: Scaling, Sharding, Replication, CAP Theorem, Consistency}} | {{ :courses:b4m36ds2:b4m36ds2-lab-02-formats.pdf |Data Formats}} {{ :courses:be4m36ds2:lab-01-formats-solutions.zip | Solutions}} {{ :courses:be4m36ds2:lab2-example.zip | Example}} | HW1 20.10.2024 | | 4 | 14.10.2024 | {{ :courses:b4m36ds2:b4m36ds2-lecture-04_-_redis.pdf |Key-value stores: Redis. Data types}} | {{ :courses:b4m36ds2:b4m36ds2-lab-04-redis.pdf |Key-value stores: Redis. Data types}} {{ :courses:be4m36ds2:solutions-4.txt |Solutions}}| HW2 3.11.2024 | | 5 | 21.10.2024 | {{ :courses:b4m36ds2:b4m36ds2-lecture-05-redis2.pdf |Key-value stores: Redis}} | {{ :courses:b4m36ds2:b4m36ds2-lab-05-redis.pdf |Key-value stores: Redis}} {{ :courses:b4m36ds2:solutions-5.txt |Solutions}}| | | 6 | 28.10.2024 | State holiday | | | | 7 | 04.11.2024 | {{ :courses:b4m36ds2:b4m36ds2-lecture-6-mongodb.pdf |Document Databases: MongoDB}} | {{ :courses:b4m36ds2:b4m36ds2-lab-06-mongodb.pdf |MongoDB}} | HW3 24.11.2024 | | 8 | 11.11.2024 | {{ :courses:b4m36ds2:b4m36ds2-lecture-7-mongodb.pdf |Document Databases: MongoDB. Aggregation}} | {{ :courses:b4m36ds2:b4m36ds2-lab-7-mongodb.pdf |MongoDB}} {{ :courses:b4m36ds2:data_solutins-7.zip |Data and solutions}}| | | 9 | 18.11.2024 | {{ :courses:b4m36ds2:b4m36ds2-lecture-08-cassandra.pdf |Wide Column Stores: Cassandra}} | {{ :courses:be4m36ds2:b4m36ds2-lab-11-mongodb_.pdf |MongoDB. Aggregation}} {{ :courses:be4m36ds2:data_solution10.zip |Data and Solutions}}| HW4 1.12.2024 | | 10 | 25.11.2024 | {{ :courses:be4m36ds2:b4m36ds2-lecture-10-neo4j.pdf |Graph Databases: Neo4j. Traversal Framework}} | {{ :courses:be4m36ds2:b4m36ds2-lab-07-cassandra.pdf |Cassandra}} {{ :courses:be4m36ds2:solutions-9.txt |Solutions}}| | | 11 | 02.12.2024 | {{ :courses:b4m36ds2:b4m36ds2-lecture-10-cypher.pdf |Graph Databases: Neo4j. Cypher}} | {{ :courses:b4m36ds2:b4m36ds2-lab-10-neo4j.pdf |Neo4j. Cypher}} {{ :courses:b4m36ds2:data_solution.zip |Data and Solutions}}| HW5 15.12.2024 | | 12 | 09.12.2024 | {{ :courses:be4m36ds2:b4m36ds2-lecture-12-advanced.pdf |Advanced Aspects: Graph Databases, Performance Tuning}} | {{ :courses:b4m36ds2:b4m36ds2-lab-11-python-neo4j.pdf |Neo4j+Python}} {{ :courses:be4m36ds2:b4m36ds2-lab-12-neo4j-new.pdf |Neo4j+Java}} {{ :courses:b4m36ds2:examples-11.zip |Examples}}| | | 13 | 16.12.2024 | {{ :courses:be4m36ds2:b4m36ds2-lecture-12-mapreduce.pdf |Apache Hadoop: MapReduce, HDFS}} | {{ :courses:be4m36ds2:b4m36ds2-lab-03-lecture-mapreduce.pdf |MapReduce, HDFS}} {{ :courses:be4m36ds2:libraries_java_files.zip |Libraries & Java files}} | HW6 22.12.2024 | | 14 | 06.01.2024 | {{ :courses:be4m36ds2:b4m36ds2-lecture-12-sparql.pdf |RDF Stores. SPARQL}} | Demonstration of the project | | The course uses [[https://www.ksi.mff.cuni.cz/~svoboda/courses/211-B4M36DS2/| materials of Martin Svoboda]], the former lecturer of this course. ===== Tools for Practical Classes and Homework ===== JSON: [[https://codebeautify.org/jsonvalidator|JSON Validator]] Redis online editor: [[https://onecompiler.com/redis]] MongoDB: [[https://mongoplayground.net|Mongo Playground]] Neo4j: [[https://console.neo4j.org|Neo4j Console]] SPARQL: [[https://nosql.opendata.cz/sparql|SPARQL Query Editor]] ===== Formal Requirements ===== * **Attendance** during lectures and practical classes is **recommended** but not compulsory * **Homework** * Altogether **7 mandatory individual homework assignments** will be given during the semester (HW0 - HW6). They are parts of the same project * Everyone must **choose** their **distinct topic**, not later than 6. 10. 2024. The lecturer must accept the topic. * You can organize into **teams** of up to 3 students to work on the project * If a team performs the project, all quantity standards (data, queries, etc.) must be multiplied by the number of students * Solutions must be submitted via a script executed on the NoSQL server. The same script with a screenshot must be submitted to **BRUTE** * Solutions can be submitted even repeatedly (before the deadline); only **the latest version is assessed** * Once the lecturer assesses a given assignment, it cannot be resubmitted once again * **Delay** of one whole day **is penalized** by 1 point, shorter delays are penalized proportionally. Delay of two days or longer is penalized by 2 points * All the homework assignments must be submitted before the intended exam date to be considered * In this course, you can get up to 62 points: * Basic homework (mandatory) - up to 20 points * Project - up to 26 points * Bonus and extra assignments - up to 16 points * Since the maximum number of points that can be counted towards the final grade is 42, completing all the assignments is optional. You can choose the ones you are most interested in. * **Exam** * Only students with a course credit already acquired can sign up for the final exam * The **final exam** consists of a compulsory written test and an optional oral examination. * The **written test** consists of a theoretical part (open and multiple choice questions) and a practical part (exercises) * Having **less than 12 points** from theoretical and **18 points** from practical part **prevents from passing the exam successfully** * Based on the result, everyone can voluntarily undergo an **oral examination**. The only condition is to have at least 50 points * In such a case, the final score is further adjusted by **up to minus 10 to plus 5 points** * The examiner can also request the oral examination in case of uncertainties in the test ===== Assessment ===== The grade is calculated as the total points for homework, exam, and bonus points. ^ Assignment ^ Maximum ^ Minimum ^ | Homework | 32 | 20 | | Exam | 70 | 30 | | Bonus | 10 | 0 | ^ Total ^ 112 ^ 50 ^ Final grade: **90 points and more for A, 80+ for B, 70+ for C, 60+ for D, and 50+ for E** ===== Exam Details ===== The exam will be conducted on computers in the classroom (KN E-328) using Moodle (https://moodle.fel.cvut.cz/course/view.php?id=8987) and consists of two parts: * Theoretical section (60 minutes): covers all theoretical concepts studied throughout the course * Practical section (90 minutes): focuses on working with four types of NoSQL systems: Redis, MongoDB, Cassandra, and Neo4j The exam includes: * Questions with either a single correct answer (single-choice) or multiple correct answers (multiple-choice) * Open-ended questions where students will primarily write queries for the provided dataset * Access to the NoSQL server for testing queries Additional notes: * Students may supplement their answers with handwritten illustrations or comments * Exam results will be posted on both Moodle and BRUTE within 2 days After receiving results, students may: * Request an oral examination (which can adjust their written exam score by ±5 points) * Register to retake the exam ===== Exam Requirements ===== ==== NoSQL Introduction ==== * **Big Data and NoSQL** terms, **V characteristics** (volume, variety, velocity, veracity, value, validity, volatility), **current trends** and challenges (Big Data, Big Users, processing paradigms, ...), principles of **relational databases** (functional dependencies, normal forms, transactions, ACID properties); **types of NoSQL systems** (key-value, wide column, document, graph, ...), their data models, features and use cases; **common features** of NoSQL systems (aggregates, schemalessness, scaling, flexibility, sharding, replication, automated maintenance, eventual consistency, ...) ==== Data Formats ==== * **XML**: constructs (element, attribute, text, ...), content model (empty, text, elements, mixed), entities, well-formedness; document and data oriented XML * **JSON**: constructs (object, array, value), types of values (strings, numbers, ...); **BSON**: document structure (elements, type selectors, property names and values) * **RDF**: data model (resources, referents, values), triples (subject, predicate, object), statements, blank nodes, IRI identifiers, literals (types, language tags); graph representation (vertices, edges); **N-Triples notation** (RDF file, statements, triple components, literals, IRI references); **Turtle notation** (TTL file, prefix definitions, triples, object and predicate-object lists, blank nodes, prefixed names, literals) * **CSV**: constructs (document, header, record, field) ==== MapReduce ==== * **Programming models**, paradigms and languages; parallel programming models, process interaction (shared memory, message passing, implicit interaction), problem decomposition (task parallelism, data parallelism, implicit parallelism) * **MapReduce**: programming model (data parallelism, map and reduce functions), **cluster architecture** (master, workers, message passing, data distribution), **map and reduce functions** (input arguments, emission and reduction of intermediate key-value pairs, final output), **data flow phases** (mapping, shuffling, reducing), input parsing (input file, split, record), **execution steps** (parsing, mapping, partitioning, combining, merging, reducing), **combine function** (commutativity, associativity), additional functions (input reader, partition, compare, output writer), **implementation details** (counters, fault tolerance, stragglers, task granularity), usage patterns (aggregation, grouping, querying, sorting, ...) * Apache **Hadoop**: modules (Common, HDFS, YARN, MapReduce), related projects (Cassandra, HBase, ...); **HDFS**: data model (hierarchical namespace, directories, files, blocks, permissions), architecture (NameNode, DataNode, HeartBeat messages, failures), replica placement (rack-aware strategy), FsImage (namespace, mapping of blocks, system properties) and EditLog structures, FS commands (ls, mkdir, ...); **MapReduce**: architecture (JobTracker, TaskTracker), job implementation (Configuration; Mapper, Reducer, and Combiner classes; Context, write method; Writable and WritableComparable interfaces), job execution schema ==== NoSQL Principles ==== * **Scaling**: scalability definition; **vertical scaling** (scaling up/down), pros and cons (performance limits, higher costs, vendor lock-in, ...); **horizontal scaling** (scaling out/in), pros and cons, **network fallacies** (reliability, latency, bandwidth, security, ...), **cluster** architecture; design questions (scalability, availability, consistency, latency, durability, resilience) * **Distribution** models: **sharding**: idea, motivation, objectives (balanced distribution, workload, ...), strategies (mapping structures, general rules), difficulties (evaluation of requests, changing cluster structure, obsolete or incomplete knowledge, network partitioning, ...); **replication**: idea, motivation, objectives, replication factor, architectures (master-slave and peer-to-peer), internal details (handling of read and write requests, consistency issues, failure recovery), replica placement strategies; mutual combinations of **sharding and replication** * **CAP** theorem: CAP guarantees (consistency, availability, partition tolerance), CAP theorem, consequences (CA, CP and AP systems), consistency-availability spectrum, **ACID properties** (atomicity, consistency, isolation, durability), **BASE properties** (basically available, soft state, eventual consistency) * **Consistency**: strong vs. eventual consistency; **write consistency** (write-write conflict, context, pessimistic and optimistic strategies), **read consistency** (read-write conflict, context, inconsistency window, session consistency), **read and write quora** (formulae, motivation, workload balancing) ==== Key-Value Stores ==== * Data model (key-value pairs), **key management** (real-world identifiers, automatically generated, structured keys, prefixes), basic CRUD operations, use cases, representatives, extended functionality (MapReduce, TTL, links, structured store, ...) * **Redis**: features (in-memory, data structure store), data model (databases, objects), data types (string, list, set, sorted set, hash), **string** commands (SET, GET, APPEND, SETRANGE, INCR, DEL, ...), **list** commands (LPUSH, RPUSH, LPOP, RPOP, LINDEX, LRANGE, LREM, ...), **set** commands (SADD, SISMEMBER, SUNION, SINTER, SDIFF, SREM, ...), **sorted set** commands (ZADD, ZRANGE, ZRANGEBYSCORE, ZINCRBY, ZREM, ...), **hash** commands (HSET, HMSET, HGET, HMGET, HKEYS, HVALS, HDEL, ...), **general** commands (EXISTS, KEYS, DEL, RENAME, ...), **time-to-live** commands (EXPIRE, TTL, PERSIST) ==== Wide Column Stores ==== * Data model (column families, rows, columns), query patterns, use cases, representatives * **Cassandra**: data model (keyspaces, tables, rows, columns), primary keys (partition key, clustering columns), column values (missing; empty; native data types, tuples, user-defined types; collections: lists, sets, maps; frozen mode), additional data (TTL, timestamp); **CQL** language: DDL statements: **CREATE KEYSPACE** (replication options), DROP KEYSPACE, USE keyspace, **CREATE TABLE** (column definitions, usage of types, primary key), DROP TABLE, TRUNCATE TABLE; native data types (int, varint, double, boolean, text, timestamp, ...); literals (atomic, collections, ...); DML statements: **SELECT** statements (SELECT, FROM, WHERE, GROUP BY, ORDER BY, and LIMIT clauses; DISTINCT modifier; selectors; non/filtering queries, ALLOW FILTERING mode; filtering relations; aggregates; restrictions on sorting and aggregation), **INSERT** statements (update parameters: TTL, TIMESTAMP), **UPDATE** statements (assignments; modification of collections: additions, removals), **DELETE** statements (deletion of rows, removal of columns, removal of items from collections) ==== Document Stores ==== * Data model (documents), query patterns, use cases, representatives * **MongoDB**: data model (databases, collections, documents, field names), document identifiers (features, ObjectId), data modeling (embedded documents, references); CRUD operations (insert, update, save, remove, find); **insert** operation (management of identifiers); **update** operation: replace vs. update mode, multi option, upsert mode, update operators (field: set, rename, inc, ...; array: push, pop, ...); **save** operation (insert vs. replace mode); **remove** operation (justOne option); **find** operation: query conditions (value equality vs. query operators), query operators (comparison: eq, ne, ...; element: exists; evaluation: regex, ...; logical: and, or, not; array: all, elemMatch, ...), dot notation (embedded fields, array items), querying of arrays, projection (positive, negative), projection operators (array: slice, elemMatch), **modifiers** (sort, skip, limit); **MapReduce** (map function, reduce function, options: query, sort, limit, out); primary and secondary **index structures** (index types: value, hashed, ...; forms; properties: unique, partial, sparse, TTL) ==== Graph Databases ==== * Data model (property graphs), use cases, representatives * **Neo4j**: data model (graph, nodes, relationships, directions, labels, types, properties), properties (fields, atomic values, arrays); embedded database mode; **traversal framework**: traversal description, **order** (breadth-first, depth-first, branch ordering policies), **expanders** (relationship types, directions), **uniqueness** (NODE_GLOBAL, RELATIONSHIP_GLOBAL, ...), **evaluators** (INCLUDE/EXCLUDE and CONTINUE/PRUNE results; predefined evaluators: all, excludeStartPosition, ...; custom evaluators: evaluate method), traverser (starting nodes, iteration modes: paths, end nodes, last relationships); Java interface (labels, types, nodes, relationships, properties, transactions); **Cypher** language: graph matching (solutions, variable bindings); query sub/clauses (read, write, general); **path patterns**, node patterns (variable, labels, properties), relationship patterns (variable, types, properties, variable length); **MATCH** clause (path patterns, WHERE conditions, uniqueness requirement, OPTIONAL mode); **RETURN** clause (DISTINCT modifier, ORDER BY, LIMIT, SKIP subclauses, aggregation); **WITH** clause (motivation, subclauses); **write clauses**: CREATE, DELETE (DETACH mode), SET (properties, labels), REMOVE (properties, labels); query structure (chaining of clauses, query parts, restrictions) ==== Advanced Aspects ==== * **Graph databases**: non/transactional databases, query patterns (CRUD, graph algorithms, graph traversals, graph pattern matching, similarity querying); data **structures** (adjacency matrix, adjacency list, incidence matrix, Laplacian matrix), graph **traversals**, **data locality** (BFS layout, matrix bandwidth, bandwidth minimization problem, Cuthill-McKee algorithm), **graph partitioning** (1D partitioning, 2D partitioning, BFS evaluation), graph matching (sub-graph, super-graph patterns), non/mining based indexing * **Performance tuning**: scalability goals (reduce latency, increase throughput), Amdahl's law, Little's law, message cost model * **Polyglot persistence** ==== RDF Stores ==== * **Linked Data**: principles (identification, standard formats, interlinking, open license), Linked Open Data Cloud * **SPARQL**: graph pattern matching (solution sequence, solution, variable binding, compatibility of solutions), **graph patterns** (basic, group, optional, alternative, graph, minus); **prologue declarations** (BASE, PREFIX clauses), **SELECT** queries (SELECT, FROM, and WHERE clauses), query **dataset** (default graph, named graphs), **variable assignments** (BIND), **FILTER** constraints (comparisons, logical connectives, accessors, tests, ...), **solution modifiers** (DISTINCT, REDUCED; aggregation: GROUP BY, HAVING; sorting: ORDER BY, LIMIT, OFFSET), **query forms** (SELECT, ASK, DESCRIBE, CONSTRUCT) ===== Recommended Literature ===== * Holubová, Irena - Kosek, Jiří - Minařík, Karel - Novák, David: [[http://www.ksi.mff.cuni.cz/bigdata/|Big Data a NoSQL databáze]].\\ ISBN: 978-80-247-5466-6 (hardcover), 978-80-247-5938-8 (eBook PDF), 978-80-247-5939-5 (eBook EPUB).\\ Grada Publishing, a.s., 2015. * Sadalage, Pramod J. - Fowler, Martin: [[http://martinfowler.com/books/nosql.html|NoSQL Distilled]].\\ ISBN: 978-0-321-82662-6.\\ Pearson Education, Inc., 2013. * Wiese, Lena: [[https://www.degruyter.com/viewbooktoc/product/460529|Advanced Data Management: For SQL, NoSQL, Cloud and Distributed Databases]].\\ ISBN: 978-3-11-044140-6 (hardcover), 978-3-11-044141-3 (eBook PDF), 978-3-11-043307-4 (eBook EPUB).\\ DOI: [[http://doi.org/10.1515/9783110441413|10.1515/9783110441413]].\\ Walter de Gruyter GmbH, 2015. * Zomaya, Albert Y. - Sakr, Sherif: [[http://www.springer.com/gp/book/9783319493398|Handbook of Big Data Technologies]].\\ ISBN: 978-3-319-49339-8 (hardcover), 978-3-319-49340-4 (eBook).\\ DOI: [[http://doi.org/10.1007/978-3-319-49340-4|10.1007/978-3-319-49340-4]].\\ Springer International Publishing AG, 2017.