====== B4M36DS2, BE4M36DS2: Database Systems 2 ====== ===== Basic Information ===== * Annotations: [[https://www.fel.cvut.cz/cz/education/bk/predmety/47/02/p4702006.html|B4M36DS2]], [[https://www.fel.cvut.cz/en/education/bk/predmety/48/78/p4878406.html|BE4M36DS2]] (English) * Lecturer and tutor: **Yuliia Prokop ** * Schedule: [[https://intranet.fel.cvut.cz/cz/education/rozvrhy-ng.B231/public/html/predmety/47/02/p4702006.html|B4M36DS2]], [[https://intranet.fel.cvut.cz/cz/education/rozvrhy-ng.B231/public/html/predmety/48/78/p4878406.html|BE4M36DS2]] ===== Course Schedule and Materials ===== ^ # ^ date ^ Lecture ^ Practical class ^ HW deadline ^ | 1 | 25.09.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-01-introduction.pdf |Introduction: Big Data, NoSQL Databases}} | {{ :courses:be4m36ds2:b4m36ds2-lab-01-introduction.pdf |Relational DBS vs NoSQL DBS}} | | | 2 | 02.10.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-02-formats.pdf |Data Formats: XML, JSON, BSON, RDF}} | {{ :courses:be4m36ds2:b4m36ds2-lab-02-formats.pdf |Data Formats}} {{ :courses:be4m36ds2:lab-01-formats-solutions.zip | Solutions}} {{ :courses:be4m36ds2:lab2-example.zip | Example}} | 8.10.2023 | | 3 | 09.10.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-03-mapreduce.pdf |Apache Hadoop: MapReduce, HDFS}} | {{ :courses:be4m36ds2:b4m36ds2-lab-03-lecture-mapreduce.pdf |MapReduce, HDFS}} {{ :courses:be4m36ds2:libraries_java_files.zip |Libraries & Java files}} | 22.10.2023 | | 4 | 16.10.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-04-principlescorrected.pdf |Basic Principles: Scaling, Sharding, Replication, CAP Theorem, Consistency}} | {{ :courses:be4m36ds2:b4m36ds2-lab-04-xpath_xquery.pdf |XML Databases}} {{ :courses:be4m36ds2:data_solutions-4.zip |data and solutions}} {{ :courses:be4m36ds2:b4m36ds2-lecture-03-xpath.pdf |XPath lecture 2022}} {{ :courses:be4m36ds2:b4m36ds2-lecture-04-xquery.pdf |XQuery lecture 2022}}| }} | 5 | 23.10.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-05-redis.pdf |Key-value stores: Redis. Data types}} | {{ :courses:be4m36ds2:b4m36ds2-lab-05-redis.pdf |Key-value stores: Redis. Data types}} {{ :courses:be4m36ds2:solutions-5.txt |Solutions}}| 5.11.2023 | | 6 | 30.10.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-06-redis-.pdf |Key-value stores: Redis}} | {{ :courses:be4m36ds2:b4m36ds2-lab-06-redis.pdf |Key-value stores: Redis}} {{ :courses:be4m36ds2:solutions-6.txt |Solutions}}| | | 7 | 06.11.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-07-cassandra.pdf |Wide Column Stores: Cassandra}} | {{ :courses:be4m36ds2:b4m36ds2-lab-07-lecture-cassandra.pdf |Cassandra}} {{ :courses:be4m36ds2:solutions-7.txt |Solutions}}| 19.11.2023 | | 8 | 13.11.2023 | {{ :courses:b4m36ds2:b4m36ds2-lecture-8-mongodb.pdf |Document Databases: MongoDB}} | {{ :courses:be4m36ds2:b4m36ds2-lab-08-mongodb.pdf |MongoDB}} {{ :courses:be4m36ds2:solutions-8.txt |Solutions}}| 10.12.2023 | | 9 | 20.11.2023 | Rector's day | Rector's day | | | 10 | 27.11.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-9-mongodb_.pdf |Document Databases: MongoDB. Aggregation}} | {{ :courses:be4m36ds2:b4m36ds2-lab-9-mongodb.pdf |MongoDB}} {{ :courses:be4m36ds2:data_solutins-9.zip |Data and solutions}}| | | 11 | 04.12.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-10-neo4j.pdf |Graph Databases: Neo4j. Traversal Framework}} | {{ :courses:be4m36ds2:b4m36ds2-lab-11-mongodb_.pdf |MongoDB. Aggregation}} {{ :courses:be4m36ds2:data_solution10.zip |Data and Solutions}}| | | 12 | 11.12.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-11-cypher.pdf |Graph Databases: Neo4j. Cypher}} | {{ :courses:be4m36ds2:b4m36ds2-lab-11-neo4j.pdf |Neo4j. Cypher}} {{ :courses:be4m36ds2:data_solution-11.zip |Data and Solutions}}| 31.12.2023 | | 13 | 18.12.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-12-advanced.pdf |Advanced Aspects: Graph Databases, Performance Tuning}} | {{ :courses:be4m36ds2:b4m36ds2-lab-12-neo4j-new.pdf |Neo4j}} | | | 14 | 08.01.2024 | {{ :courses:be4m36ds2:b4m36ds2-lecture-12-sparql.pdf |RDF Stores. SPARQL}} | {{ :courses:be4m36ds2:b4m36ds2-lab-12-sparql.pdf |RDF Stores. SPARQL}} | | The course uses [[https://www.ksi.mff.cuni.cz/~svoboda/courses/211-B4M36DS2/| materials of Martin Svoboda]], the former lecturer of this course. ===== Tools for Practical Classes and Homework ===== XML: [[https://codebeautify.org/xmlvalidator|XML Validator]], [[https://videlibri.sourceforge.net/cgi-bin/xidelcgi|XPath and XQuery Processor]] JSON: [[https://codebeautify.org/jsonvalidator|JSON Validator]] MongoDB: : [[https://mongoplayground.net|Mongo Playground]] SPARQL: [[https://nosql.opendata.cz/sparql|SPARQL Query Editor]] ===== Formal Requirements ===== * **Attendance** during lectures and practical classes is **recommended** but not compulsory * Altogether **6 mandatory individual homework assignments** will be given during the semester * Everyone must **choose** their **distinct topic**, not later than 1. 10. 2023 * This topic must be reported to and explicitly accepted by the lecturer in advance * Possible topics could be: library, cinema, cookbook, university, flights, etc. (see the list on the Homework Assignments page for additional suitable topics, feel free to choose your own topic) * Your homework solutions must be **within the topic, original, realistic, and non-trivial** * Solutions can only be submitted via a script executed on the corresponding server * Report in .pdf must be submitted to **BRUTE** * **At most 130 points** in total can be gained **for all the homework assignments** * In case of any shortcomings, fewer points will be awarded appropriately * Solutions can be submitted even repeatedly (before the deadline), only **the latest version is assessed** * Once a given assignment is assessed by the lecturer, it cannot be resubmitted once again * **Delay** of one whole day **is penalized** by 5 points, shorter delays are penalized proportionally * Should the delay be even longer, the penalty stays the same and does not further increase * All the homework assignments must be submitted before the intended exam date in order to be considered * During some of the practical classes, **extra activity points** can be acquired, too * **At least 80 points** is required for the **course credit** to be granted * Half of all the points above 100 is transferred as **bonus points** to the exam * Only students with a course credit already acquired can sign up for the final exam * The **final exam** consists of a compulsory written test and an optional oral examination * **At most 100 points** can be acquired from the actual final **written test** * This test consists of a theoretical part (open and multiple choice questions) and a practical part (exercises) * Having **less than 30% points** from any of the two parts **prevents from passing the exam successfully** * The **final score** corresponds to the **sum of the written test and bonus points**, if any * Based on the result, everyone can voluntarily choose to undergo an **oral examination** * The only condition is to have at least 50 points from the test and bonus points * In such a case, the final score is further adjusted by **up to minus 10 to plus 5 points** * The oral examination can also be requested by the examiner in case of uncertainties in the test * Final grade: **90 points and more for A, 80+ for B, 70+ for C, 60+ for D, and 50+ for E** ===== Exam Requirements ===== ==== NoSQL Introduction ==== * **Big Data and NoSQL** terms, **V characteristics** (volume, variety, velocity, veracity, value, validity, volatility), **current trends** and challenges (Big Data, Big Users, processing paradigms, ...), principles of **relational databases** (functional dependencies, normal forms, transactions, ACID properties); **types of NoSQL systems** (key-value, wide column, document, graph, ...), their data models, features and use cases; **common features** of NoSQL systems (aggregates, schemalessness, scaling, flexibility, sharding, replication, automated maintenance, eventual consistency, ...) ==== Data Formats ==== * **XML**: constructs (element, attribute, text, ...), content model (empty, text, elements, mixed), entities, well-formedness; document and data oriented XML * **JSON**: constructs (object, array, value), types of values (strings, numbers, ...); **BSON**: document structure (elements, type selectors, property names and values) * **RDF**: data model (resources, referents, values), triples (subject, predicate, object), statements, blank nodes, IRI identifiers, literals (types, language tags); graph representation (vertices, edges); **N-Triples notation** (RDF file, statements, triple components, literals, IRI references); **Turtle notation** (TTL file, prefix definitions, triples, object and predicate-object lists, blank nodes, prefixed names, literals) * **CSV**: constructs (document, header, record, field) ==== MapReduce ==== * **Programming models**, paradigms and languages; parallel programming models, process interaction (shared memory, message passing, implicit interaction), problem decomposition (task parallelism, data parallelism, implicit parallelism) * **MapReduce**: programming model (data parallelism, map and reduce functions), **cluster architecture** (master, workers, message passing, data distribution), **map and reduce functions** (input arguments, emission and reduction of intermediate key-value pairs, final output), **data flow phases** (mapping, shuffling, reducing), input parsing (input file, split, record), **execution steps** (parsing, mapping, partitioning, combining, merging, reducing), **combine function** (commutativity, associativity), additional functions (input reader, partition, compare, output writer), **implementation details** (counters, fault tolerance, stragglers, task granularity), usage patterns (aggregation, grouping, querying, sorting, ...) * Apache **Hadoop**: modules (Common, HDFS, YARN, MapReduce), related projects (Cassandra, HBase, ...); **HDFS**: data model (hierarchical namespace, directories, files, blocks, permissions), architecture (NameNode, DataNode, HeartBeat messages, failures), replica placement (rack-aware strategy), FsImage (namespace, mapping of blocks, system properties) and EditLog structures, FS commands (ls, mkdir, ...); **MapReduce**: architecture (JobTracker, TaskTracker), job implementation (Configuration; Mapper, Reducer, and Combiner classes; Context, write method; Writable and WritableComparable interfaces), job execution schema ==== NoSQL Principles ==== * **Scaling**: scalability definition; **vertical scaling** (scaling up/down), pros and cons (performance limits, higher costs, vendor lock-in, ...); **horizontal scaling** (scaling out/in), pros and cons, **network fallacies** (reliability, latency, bandwidth, security, ...), **cluster** architecture; design questions (scalability, availability, consistency, latency, durability, resilience) * **Distribution** models: **sharding**: idea, motivation, objectives (balanced distribution, workload, ...), strategies (mapping structures, general rules), difficulties (evaluation of requests, changing cluster structure, obsolete or incomplete knowledge, network partitioning, ...); **replication**: idea, motivation, objectives, replication factor, architectures (master-slave and peer-to-peer), internal details (handling of read and write requests, consistency issues, failure recovery), replica placement strategies; mutual combinations of **sharding and replication** * **CAP** theorem: CAP guarantees (consistency, availability, partition tolerance), CAP theorem, consequences (CA, CP and AP systems), consistency-availability spectrum, **ACID properties** (atomicity, consistency, isolation, durability), **BASE properties** (basically available, soft state, eventual consistency) * **Consistency**: strong vs. eventual consistency; **write consistency** (write-write conflict, context, pessimistic and optimistic strategies), **read consistency** (read-write conflict, context, inconsistency window, session consistency), **read and write quora** (formulae, motivation, workload balancing) ==== Key-Value Stores ==== * Data model (key-value pairs), **key management** (real-world identifiers, automatically generated, structured keys, prefixes), basic CRUD operations, use cases, representatives, extended functionality (MapReduce, TTL, links, structured store, ...) * **Redis**: features (in-memory, data structure store), data model (databases, objects), data types (string, list, set, sorted set, hash), **string** commands (SET, GET, APPEND, SETRANGE, INCR, DEL, ...), **list** commands (LPUSH, RPUSH, LPOP, RPOP, LINDEX, LRANGE, LREM, ...), **set** commands (SADD, SISMEMBER, SUNION, SINTER, SDIFF, SREM, ...), **sorted set** commands (ZADD, ZRANGE, ZRANGEBYSCORE, ZINCRBY, ZREM, ...), **hash** commands (HSET, HMSET, HGET, HMGET, HKEYS, HVALS, HDEL, ...), **general** commands (EXISTS, KEYS, DEL, RENAME, ...), **time-to-live** commands (EXPIRE, TTL, PERSIST) ==== Wide Column Stores ==== * Data model (column families, rows, columns), query patterns, use cases, representatives * **Cassandra**: data model (keyspaces, tables, rows, columns), primary keys (partition key, clustering columns), column values (missing; empty; native data types, tuples, user-defined types; collections: lists, sets, maps; frozen mode), additional data (TTL, timestamp); **CQL** language: DDL statements: **CREATE KEYSPACE** (replication options), DROP KEYSPACE, USE keyspace, **CREATE TABLE** (column definitions, usage of types, primary key), DROP TABLE, TRUNCATE TABLE; native data types (int, varint, double, boolean, text, timestamp, ...); literals (atomic, collections, ...); DML statements: **SELECT** statements (SELECT, FROM, WHERE, GROUP BY, ORDER BY, and LIMIT clauses; DISTINCT modifier; selectors; non/filtering queries, ALLOW FILTERING mode; filtering relations; aggregates; restrictions on sorting and aggregation), **INSERT** statements (update parameters: TTL, TIMESTAMP), **UPDATE** statements (assignments; modification of collections: additions, removals), **DELETE** statements (deletion of rows, removal of columns, removal of items from collections) ==== Document Stores ==== * Data model (documents), query patterns, use cases, representatives * **MongoDB**: data model (databases, collections, documents, field names), document identifiers (features, ObjectId), data modeling (embedded documents, references); CRUD operations (insert, update, save, remove, find); **insert** operation (management of identifiers); **update** operation: replace vs. update mode, multi option, upsert mode, update operators (field: set, rename, inc, ...; array: push, pop, ...); **save** operation (insert vs. replace mode); **remove** operation (justOne option); **find** operation: query conditions (value equality vs. query operators), query operators (comparison: eq, ne, ...; element: exists; evaluation: regex, ...; logical: and, or, not; array: all, elemMatch, ...), dot notation (embedded fields, array items), querying of arrays, projection (positive, negative), projection operators (array: slice, elemMatch), **modifiers** (sort, skip, limit); **MapReduce** (map function, reduce function, options: query, sort, limit, out); primary and secondary **index structures** (index types: value, hashed, ...; forms; properties: unique, partial, sparse, TTL) ==== Graph Databases ==== * Data model (property graphs), use cases, representatives * **Neo4j**: data model (graph, nodes, relationships, directions, labels, types, properties), properties (fields, atomic values, arrays); embedded database mode; **traversal framework**: traversal description, **order** (breadth-first, depth-first, branch ordering policies), **expanders** (relationship types, directions), **uniqueness** (NODE_GLOBAL, RELATIONSHIP_GLOBAL, ...), **evaluators** (INCLUDE/EXCLUDE and CONTINUE/PRUNE results; predefined evaluators: all, excludeStartPosition, ...; custom evaluators: evaluate method), traverser (starting nodes, iteration modes: paths, end nodes, last relationships); Java interface (labels, types, nodes, relationships, properties, transactions); **Cypher** language: graph matching (solutions, variable bindings); query sub/clauses (read, write, general); **path patterns**, node patterns (variable, labels, properties), relationship patterns (variable, types, properties, variable length); **MATCH** clause (path patterns, WHERE conditions, uniqueness requirement, OPTIONAL mode); **RETURN** clause (DISTINCT modifier, ORDER BY, LIMIT, SKIP subclauses, aggregation); **WITH** clause (motivation, subclauses); **write clauses**: CREATE, DELETE (DETACH mode), SET (properties, labels), REMOVE (properties, labels); query structure (chaining of clauses, query parts, restrictions) ==== Advanced Aspects ==== * **Graph databases**: non/transactional databases, query patterns (CRUD, graph algorithms, graph traversals, graph pattern matching, similarity querying); data **structures** (adjacency matrix, adjacency list, incidence matrix, Laplacian matrix), graph **traversals**, **data locality** (BFS layout, matrix bandwidth, bandwidth minimization problem, Cuthill-McKee algorithm), **graph partitioning** (1D partitioning, 2D partitioning, BFS evaluation), graph matching (sub-graph, super-graph patterns), non/mining based indexing * **Performance tuning**: scalability goals (reduce latency, increase throughput), Amdahl's law, Little's law, message cost model * **Polyglot persistence** ==== XML Databases ==== * Native XML databases vs. XML-enabled relational databases; data model (**XDM**): tree (nodes for document, elements, attributes, texts, ...), document order, reverse document order, sequences, atomic values, singleton sequences * **XPath** language: **path** expressions (relative vs. absolute, evaluation algorithm), path step (axis, node test, predicates), **axes** (forward: child, descendant, following, ...; reverse: parent, ancestor, preceding, ...; attribute), **node tests**, **predicates** (path conditions, position testing, ...), abbreviations * **XQuery** language: path expressions, **direct constructors** (elements, attributes, nested queries, well-formedness), **computed constructors** (dynamic names), **FLWOR** expressions (for, let, where, order by, and return clauses), typical FLWOR use cases (joining, grouping, aggregation, integration, ...), **conditional** expressions (if, then, else), **switch** expressions (case, default, return), universal and existential **quantified** expressions (some, every, satisfies), **comparisons** (value, general, node; errors), atomization of values (elements, attributes) ==== RDF Stores ==== * **Linked Data**: principles (identification, standard formats, interlinking, open license), Linked Open Data Cloud * **SPARQL**: graph pattern matching (solution sequence, solution, variable binding, compatibility of solutions), **graph patterns** (basic, group, optional, alternative, graph, minus); **prologue declarations** (BASE, PREFIX clauses), **SELECT** queries (SELECT, FROM, and WHERE clauses), query **dataset** (default graph, named graphs), **variable assignments** (BIND), **FILTER** constraints (comparisons, logical connectives, accessors, tests, ...), **solution modifiers** (DISTINCT, REDUCED; aggregation: GROUP BY, HAVING; sorting: ORDER BY, LIMIT, OFFSET), **query forms** (SELECT, ASK, DESCRIBE, CONSTRUCT) ===== Recommended Literature ===== * Holubová, Irena - Kosek, Jiří - Minařík, Karel - Novák, David: [[http://www.ksi.mff.cuni.cz/bigdata/|Big Data a NoSQL databáze]].\\ ISBN: 978-80-247-5466-6 (hardcover), 978-80-247-5938-8 (eBook PDF), 978-80-247-5939-5 (eBook EPUB).\\ Grada Publishing, a.s., 2015. * Sadalage, Pramod J. - Fowler, Martin: [[http://martinfowler.com/books/nosql.html|NoSQL Distilled]].\\ ISBN: 978-0-321-82662-6.\\ Pearson Education, Inc., 2013. * Wiese, Lena: [[https://www.degruyter.com/viewbooktoc/product/460529|Advanced Data Management: For SQL, NoSQL, Cloud and Distributed Databases]].\\ ISBN: 978-3-11-044140-6 (hardcover), 978-3-11-044141-3 (eBook PDF), 978-3-11-043307-4 (eBook EPUB).\\ DOI: [[http://doi.org/10.1515/9783110441413|10.1515/9783110441413]].\\ Walter de Gruyter GmbH, 2015. * Zomaya, Albert Y. - Sakr, Sherif: [[http://www.springer.com/gp/book/9783319493398|Handbook of Big Data Technologies]].\\ ISBN: 978-3-319-49339-8 (hardcover), 978-3-319-49340-4 (eBook).\\ DOI: [[http://doi.org/10.1007/978-3-319-49340-4|10.1007/978-3-319-49340-4]].\\ Springer International Publishing AG, 2017.