====== B4M36DS2, BE4M36DS2: Database Systems 2 ======
===== Basic Information =====

  * Annotations: [[https://www.fel.cvut.cz/cz/education/bk/predmety/47/02/p4702006.html|B4M36DS2]], [[https://www.fel.cvut.cz/en/education/bk/predmety/48/78/p4878406.html|BE4M36DS2]] (English)
  * Lecturer and tutor: **Yuliia Prokop <prokoyul@fel.cvut.cz>**
  * Schedule: [[https://intranet.fel.cvut.cz/cz/education/rozvrhy-ng.B231/public/html/predmety/47/02/p4702006.html|B4M36DS2]], [[https://intranet.fel.cvut.cz/cz/education/rozvrhy-ng.B231/public/html/predmety/48/78/p4878406.html|BE4M36DS2]]


===== Course Schedule and Materials =====

^ # ^ date ^ Lecture ^ Practical class ^ HW deadline ^
| 1  | 25.09.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-01-introduction.pdf |Introduction: Big Data, NoSQL Databases}} | {{ :courses:be4m36ds2:b4m36ds2-lab-01-introduction.pdf |Relational DBS vs NoSQL DBS}} |  |
| 2  | 02.10.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-02-formats.pdf |Data Formats: XML, JSON, BSON, RDF}} | {{ :courses:be4m36ds2:b4m36ds2-lab-02-formats.pdf |Data Formats}} {{ :courses:be4m36ds2:lab-01-formats-solutions.zip | Solutions}} {{ :courses:be4m36ds2:lab2-example.zip | Example}} | 8.10.2023 |
| 3  | 09.10.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-03-mapreduce.pdf |Apache Hadoop: MapReduce, HDFS}} | {{ :courses:be4m36ds2:b4m36ds2-lab-03-lecture-mapreduce.pdf |MapReduce, HDFS}} {{ :courses:be4m36ds2:libraries_java_files.zip |Libraries & Java files}} | 22.10.2023 |
| 4  | 16.10.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-04-principlescorrected.pdf |Basic Principles: Scaling, Sharding, Replication, CAP Theorem, Consistency}} | {{ :courses:be4m36ds2:b4m36ds2-lab-04-xpath_xquery.pdf |XML Databases}} {{ :courses:be4m36ds2:data_solutions-4.zip |data and solutions}} {{ :courses:be4m36ds2:b4m36ds2-lecture-03-xpath.pdf |XPath lecture 2022}} {{ :courses:be4m36ds2:b4m36ds2-lecture-04-xquery.pdf |XQuery lecture 2022}}| }}
| 5  | 23.10.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-05-redis.pdf |Key-value stores: Redis. Data types}} | {{ :courses:be4m36ds2:b4m36ds2-lab-05-redis.pdf |Key-value stores: Redis. Data types}} {{ :courses:be4m36ds2:solutions-5.txt |Solutions}}| 5.11.2023 |
| 6  | 30.10.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-06-redis-.pdf |Key-value stores: Redis}} | {{ :courses:be4m36ds2:b4m36ds2-lab-06-redis.pdf |Key-value stores: Redis}} {{ :courses:be4m36ds2:solutions-6.txt |Solutions}}|  |
| 7  | 06.11.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-07-cassandra.pdf |Wide Column Stores: Cassandra}} | {{ :courses:be4m36ds2:b4m36ds2-lab-07-lecture-cassandra.pdf |Cassandra}} {{ :courses:be4m36ds2:solutions-7.txt |Solutions}}| 19.11.2023 |
| 8  | 13.11.2023 | {{ :courses:b4m36ds2:b4m36ds2-lecture-8-mongodb.pdf |Document Databases: MongoDB}} | {{ :courses:be4m36ds2:b4m36ds2-lab-08-mongodb.pdf |MongoDB}}  {{ :courses:be4m36ds2:solutions-8.txt |Solutions}}| 10.12.2023 |
| 9  | 20.11.2023 | Rector's day | Rector's day |  |
| 10 | 27.11.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-9-mongodb_.pdf |Document Databases: MongoDB. Aggregation}}  | {{ :courses:be4m36ds2:b4m36ds2-lab-9-mongodb.pdf |MongoDB}} {{ :courses:be4m36ds2:data_solutins-9.zip |Data and solutions}}|  |
| 11 | 04.12.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-10-neo4j.pdf |Graph Databases: Neo4j. Traversal Framework}} | {{ :courses:be4m36ds2:b4m36ds2-lab-11-mongodb_.pdf |MongoDB. Aggregation}} {{ :courses:be4m36ds2:data_solution10.zip |Data and Solutions}}|  |
| 12 | 11.12.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-11-cypher.pdf |Graph Databases: Neo4j. Cypher}} | {{ :courses:be4m36ds2:b4m36ds2-lab-11-neo4j.pdf |Neo4j. Cypher}} {{ :courses:be4m36ds2:data_solution-11.zip |Data and Solutions}}| 31.12.2023 |
| 13 | 18.12.2023 | {{ :courses:be4m36ds2:b4m36ds2-lecture-12-advanced.pdf |Advanced Aspects: Graph Databases, Performance Tuning}} | {{ :courses:be4m36ds2:b4m36ds2-lab-12-neo4j-new.pdf |Neo4j}} |  |
| 14 | 08.01.2024 | {{ :courses:be4m36ds2:b4m36ds2-lecture-12-sparql.pdf |RDF Stores. SPARQL}} | {{ :courses:be4m36ds2:b4m36ds2-lab-12-sparql.pdf |RDF Stores. SPARQL}} |  |

The course uses [[https://www.ksi.mff.cuni.cz/~svoboda/courses/211-B4M36DS2/| materials of Martin Svoboda]], the former lecturer of this course.
===== Tools for Practical Classes and Homework =====

XML: [[https://codebeautify.org/xmlvalidator|XML Validator]], [[https://videlibri.sourceforge.net/cgi-bin/xidelcgi|XPath and XQuery Processor]]

JSON: [[https://codebeautify.org/jsonvalidator|JSON Validator]]

MongoDB: : [[https://mongoplayground.net|Mongo Playground]]

SPARQL: [[https://nosql.opendata.cz/sparql|SPARQL Query Editor]]

===== Formal Requirements =====

  * **Attendance** during lectures and practical classes is **recommended** but not compulsory
  * Altogether **6 mandatory individual homework assignments** will be given during the semester
  * Everyone must **choose** their **distinct topic**, not later than 1. 10. 2023
  * This topic must be reported to and explicitly accepted by the lecturer in advance
  * Possible topics could be: library, cinema, cookbook, university, flights, etc. (see the list on the Homework Assignments page for additional suitable topics, feel free to choose your own topic)
  * Your homework solutions must be **within the topic, original, realistic, and non-trivial**
  * Solutions can only be submitted via a script executed on the corresponding server
  * Report in .pdf must be submitted to **BRUTE**
  * **At most 130 points** in total can be gained **for all the homework assignments**
  * In case of any shortcomings, fewer points will be awarded appropriately
  * Solutions can be submitted even repeatedly (before the deadline), only **the latest version is assessed**
  * Once a given assignment is assessed by the lecturer, it cannot be resubmitted once again
  * **Delay** of one whole day **is penalized** by 5 points, shorter delays are penalized proportionally
  * Should the delay be even longer, the penalty stays the same and does not further increase
  * All the homework assignments must be submitted before the intended exam date in order to be considered
  * During some of the practical classes, **extra activity points** can be acquired, too
  * **At least 80 points** is required for the **course credit** to be granted
  * Half of all the points above 100 is transferred as **bonus points** to the exam
  * Only students with a course credit already acquired can sign up for the final exam
  * The **final exam** consists of a compulsory written test and an optional oral examination
  * **At most 100 points** can be acquired from the actual final **written test**
  * This test consists of a theoretical part (open and multiple choice questions) and a practical part (exercises)
  * Having **less than 30% points** from any of the two parts **prevents from passing the exam successfully**
  * The **final score** corresponds to the **sum of the written test and bonus points**, if any
  * Based on the result, everyone can voluntarily choose to undergo an **oral examination**
  * The only condition is to have at least 50 points from the test and bonus points
  * In such a case, the final score is further adjusted by **up to minus 10 to plus 5 points**
  * The oral examination can also be requested by the examiner in case of uncertainties in the test
  * Final grade: **90 points and more for A, 80+ for B, 70+ for C, 60+ for D, and 50+ for E**

===== Exam Requirements =====

==== NoSQL Introduction ====

  * **Big Data and NoSQL** terms, **V characteristics** (volume, variety, velocity, veracity, value, validity, volatility), **current trends** and challenges (Big Data, Big Users, processing paradigms, ...), principles of **relational databases** (functional dependencies, normal forms, transactions, ACID properties); **types of NoSQL systems** (key-value, wide column, document, graph, ...), their data models, features and use cases; **common features** of NoSQL systems (aggregates, schemalessness, scaling, flexibility, sharding, replication, automated maintenance, eventual consistency, ...)

==== Data Formats ====

  * **XML**: constructs (element, attribute, text, ...), content model (empty, text, elements, mixed), entities, well-formedness; document and data oriented XML
  * **JSON**: constructs (object, array, value), types of values (strings, numbers, ...); **BSON**: document structure (elements, type selectors, property names and values)
  * **RDF**: data model (resources, referents, values), triples (subject, predicate, object), statements, blank nodes, IRI identifiers, literals (types, language tags); graph representation (vertices, edges); **N-Triples notation** (RDF file, statements, triple components, literals, IRI references); **Turtle notation** (TTL file, prefix definitions, triples, object and predicate-object lists, blank nodes, prefixed names, literals)
  * **CSV**: constructs (document, header, record, field)

==== MapReduce ====

  * **Programming models**, paradigms and languages; parallel programming models, process interaction (shared memory, message passing, implicit interaction), problem decomposition (task parallelism, data parallelism, implicit parallelism)
  * **MapReduce**: programming model (data parallelism, map and reduce functions), **cluster architecture** (master, workers, message passing, data distribution), **map and reduce functions** (input arguments, emission and reduction of intermediate key-value pairs, final output), **data flow phases** (mapping, shuffling, reducing), input parsing (input file, split, record), **execution steps** (parsing, mapping, partitioning, combining, merging, reducing), **combine function** (commutativity, associativity), additional functions (input reader, partition, compare, output writer), **implementation details** (counters, fault tolerance, stragglers, task granularity), usage patterns (aggregation, grouping, querying, sorting, ...)
  * Apache **Hadoop**: modules (Common, HDFS, YARN, MapReduce), related projects (Cassandra, HBase, ...); **HDFS**: data model (hierarchical namespace, directories, files, blocks, permissions), architecture (NameNode, DataNode, HeartBeat messages, failures), replica placement (rack-aware strategy), FsImage (namespace, mapping of blocks, system properties) and EditLog structures, FS commands (ls, mkdir, ...); **MapReduce**: architecture (JobTracker, TaskTracker), job implementation (Configuration; Mapper, Reducer, and Combiner classes; Context, write method; Writable and WritableComparable interfaces), job execution schema

==== NoSQL Principles ====

  * **Scaling**: scalability definition; **vertical scaling** (scaling up/down), pros and cons (performance limits, higher costs, vendor lock-in, ...); **horizontal scaling** (scaling out/in), pros and cons, **network fallacies** (reliability, latency, bandwidth, security, ...), **cluster** architecture; design questions (scalability, availability, consistency, latency, durability, resilience)
  * **Distribution** models: **sharding**: idea, motivation, objectives (balanced distribution, workload, ...), strategies (mapping structures, general rules), difficulties (evaluation of requests, changing cluster structure, obsolete or incomplete knowledge, network partitioning, ...); **replication**: idea, motivation, objectives, replication factor, architectures (master-slave and peer-to-peer), internal details (handling of read and write requests, consistency issues, failure recovery), replica placement strategies; mutual combinations of **sharding and replication**
  * **CAP** theorem: CAP guarantees (consistency, availability, partition tolerance), CAP theorem, consequences (CA, CP and AP systems), consistency-availability spectrum, **ACID properties** (atomicity, consistency, isolation, durability), **BASE properties** (basically available, soft state, eventual consistency)
  * **Consistency**: strong vs. eventual consistency; **write consistency** (write-write conflict, context, pessimistic and optimistic strategies), **read consistency** (read-write conflict, context, inconsistency window, session consistency), **read and write quora** (formulae, motivation, workload balancing)

==== Key-Value Stores ====

  * Data model (key-value pairs), **key management** (real-world identifiers, automatically generated, structured keys, prefixes), basic CRUD operations, use cases, representatives, extended functionality (MapReduce, TTL, links, structured store, ...)
    * **Redis**: features (in-memory, data structure store), data model (databases, objects), data types (string, list, set, sorted set, hash), **string** commands (SET, GET, APPEND, SETRANGE, INCR, DEL, ...), **list** commands (LPUSH, RPUSH, LPOP, RPOP, LINDEX, LRANGE, LREM, ...), **set** commands (SADD, SISMEMBER, SUNION, SINTER, SDIFF, SREM, ...), **sorted set** commands (ZADD, ZRANGE, ZRANGEBYSCORE, ZINCRBY, ZREM, ...), **hash** commands (HSET, HMSET, HGET, HMGET, HKEYS, HVALS, HDEL, ...), **general** commands (EXISTS, KEYS, DEL, RENAME, ...), **time-to-live** commands (EXPIRE, TTL, PERSIST)

==== Wide Column Stores ====

  * Data model (column families, rows, columns), query patterns, use cases, representatives
  * **Cassandra**: data model (keyspaces, tables, rows, columns), primary keys (partition key, clustering columns), column values (missing; empty; native data types, tuples, user-defined types; collections: lists, sets, maps; frozen mode), additional data (TTL, timestamp); **CQL** language: DDL statements: **CREATE KEYSPACE** (replication options), DROP KEYSPACE, USE keyspace, **CREATE TABLE** (column definitions, usage of types, primary key), DROP TABLE, TRUNCATE TABLE; native data types (int, varint, double, boolean, text, timestamp, ...); literals (atomic, collections, ...); DML statements: **SELECT** statements (SELECT, FROM, WHERE, GROUP BY, ORDER BY, and LIMIT clauses; DISTINCT modifier; selectors; non/filtering queries, ALLOW FILTERING mode; filtering relations; aggregates; restrictions on sorting and aggregation), **INSERT** statements (update parameters: TTL, TIMESTAMP), **UPDATE** statements (assignments; modification of collections: additions, removals), **DELETE** statements (deletion of rows, removal of columns, removal of items from collections)

==== Document Stores ====

  * Data model (documents), query patterns, use cases, representatives
  * **MongoDB**: data model (databases, collections, documents, field names), document identifiers (features, ObjectId), data modeling (embedded documents, references); CRUD operations (insert, update, save, remove, find); **insert** operation (management of identifiers); **update** operation: replace vs. update mode, multi option, upsert mode, update operators (field: set, rename, inc, ...; array: push, pop, ...); **save** operation (insert vs. replace mode); **remove** operation (justOne option); **find** operation: query conditions (value equality vs. query operators), query operators (comparison: eq, ne, ...; element: exists; evaluation: regex, ...; logical: and, or, not; array: all, elemMatch, ...), dot notation (embedded fields, array items), querying of arrays, projection (positive, negative), projection operators (array: slice, elemMatch), **modifiers** (sort, skip, limit); **MapReduce** (map function, reduce function, options: query, sort, limit, out); primary and secondary **index structures** (index types: value, hashed, ...; forms; properties: unique, partial, sparse, TTL)

==== Graph Databases ====

  * Data model (property graphs), use cases, representatives
  * **Neo4j**: data model (graph, nodes, relationships, directions, labels, types, properties), properties (fields, atomic values, arrays); embedded database mode; **traversal framework**: traversal description, **order** (breadth-first, depth-first, branch ordering policies), **expanders** (relationship types, directions), **uniqueness** (NODE_GLOBAL, RELATIONSHIP_GLOBAL, ...), **evaluators** (INCLUDE/EXCLUDE and CONTINUE/PRUNE results; predefined evaluators: all, excludeStartPosition, ...; custom evaluators: evaluate method), traverser (starting nodes, iteration modes: paths, end nodes, last relationships); Java interface (labels, types, nodes, relationships, properties, transactions); **Cypher** language: graph matching (solutions, variable bindings); query sub/clauses (read, write, general); **path patterns**, node patterns (variable, labels, properties), relationship patterns (variable, types, properties, variable length); **MATCH** clause (path patterns, WHERE conditions, uniqueness requirement, OPTIONAL mode); **RETURN** clause (DISTINCT modifier, ORDER BY, LIMIT, SKIP subclauses, aggregation); **WITH** clause (motivation, subclauses); **write clauses**: CREATE, DELETE (DETACH mode), SET (properties, labels), REMOVE (properties, labels); query structure (chaining of clauses, query parts, restrictions)

==== Advanced Aspects ====

  * **Graph databases**: non/transactional databases, query patterns (CRUD, graph algorithms, graph traversals, graph pattern matching, similarity querying); data **structures** (adjacency matrix, adjacency list, incidence matrix, Laplacian matrix), graph **traversals**, **data locality** (BFS layout, matrix bandwidth, bandwidth minimization problem, Cuthill-McKee algorithm), **graph partitioning** (1D partitioning, 2D partitioning, BFS evaluation), graph matching (sub-graph, super-graph patterns), non/mining based indexing
  * **Performance tuning**: scalability goals (reduce latency, increase throughput), Amdahl's law, Little's law, message cost model
  * **Polyglot persistence**

==== XML Databases ====

  * Native XML databases vs. XML-enabled relational databases; data model (**XDM**): tree (nodes for document, elements, attributes, texts, ...), document order, reverse document order, sequences, atomic values, singleton sequences
  * **XPath** language: **path** expressions (relative vs. absolute, evaluation algorithm), path step (axis, node test, predicates), **axes** (forward: child, descendant, following, ...; reverse: parent, ancestor, preceding, ...; attribute), **node tests**, **predicates** (path conditions, position testing, ...), abbreviations
  * **XQuery** language: path expressions, **direct constructors** (elements, attributes, nested queries, well-formedness), **computed constructors** (dynamic names), **FLWOR** expressions (for, let, where, order by, and return clauses), typical FLWOR use cases (joining, grouping, aggregation, integration, ...), **conditional** expressions (if, then, else), **switch** expressions (case, default, return), universal and existential **quantified** expressions (some, every, satisfies), **comparisons** (value, general, node; errors), atomization of values (elements, attributes)

==== RDF Stores ====

  * **Linked Data**: principles (identification, standard formats, interlinking, open license), Linked Open Data Cloud
  * **SPARQL**: graph pattern matching (solution sequence, solution, variable binding, compatibility of solutions), **graph patterns** (basic, group, optional, alternative, graph, minus); **prologue declarations** (BASE, PREFIX clauses), **SELECT** queries (SELECT, FROM, and WHERE clauses), query **dataset** (default graph, named graphs), **variable assignments** (BIND), **FILTER** constraints (comparisons, logical connectives, accessors, tests, ...), **solution modifiers** (DISTINCT, REDUCED; aggregation: GROUP BY, HAVING; sorting: ORDER BY, LIMIT, OFFSET), **query forms** (SELECT, ASK, DESCRIBE, CONSTRUCT)

===== Recommended Literature =====

  * Holubová, Irena - Kosek, Jiří - Minařík, Karel - Novák, David: [[http://www.ksi.mff.cuni.cz/bigdata/|Big Data a NoSQL databáze]].\\
ISBN: 978-80-247-5466-6 (hardcover), 978-80-247-5938-8 (eBook PDF), 978-80-247-5939-5 (eBook EPUB).\\
Grada Publishing, a.s., 2015.
  * Sadalage, Pramod J. - Fowler, Martin: [[http://martinfowler.com/books/nosql.html|NoSQL Distilled]].\\
ISBN: 978-0-321-82662-6.\\
Pearson Education, Inc., 2013.
  * Wiese, Lena: [[https://www.degruyter.com/viewbooktoc/product/460529|Advanced Data Management: For SQL, NoSQL, Cloud and Distributed Databases]].\\
ISBN: 978-3-11-044140-6 (hardcover), 978-3-11-044141-3 (eBook PDF), 978-3-11-043307-4 (eBook EPUB).\\
DOI: [[http://doi.org/10.1515/9783110441413|10.1515/9783110441413]].\\
Walter de Gruyter GmbH, 2015.
  * Zomaya, Albert Y. - Sakr, Sherif: [[http://www.springer.com/gp/book/9783319493398|Handbook of Big Data Technologies]].\\
ISBN: 978-3-319-49339-8 (hardcover), 978-3-319-49340-4 (eBook).\\
DOI: [[http://doi.org/10.1007/978-3-319-49340-4|10.1007/978-3-319-49340-4]].\\
Springer International Publishing AG, 2017.