====== B4M36DS2, BE4M36DS2: Database Systems 2 ====== ===== Basic Information ===== * Annotations: [[https://www.fel.cvut.cz/cz/education/bk/predmety/47/02/p4702006.html|B4M36DS2]], [[https://www.fel.cvut.cz/en/education/bk/predmety/48/78/p4878406.html|BE4M36DS2]] (English) * Lecturer and tutor: **Yuliia Prokop ** * Schedule: [[https://intranet.fel.cvut.cz/cz/education/rozvrhy-ng.B241/public/html/predmety/47/02/p4702006.html|B4M36DS2]], [[https://intranet.fel.cvut.cz/cz/education/rozvrhy-ng.B241/public/html/predmety/48/78/p4878406.html|BE4M36DS2]] * Upload system: [[https://cw.felk.cvut.cz/upload/|BRUTE]] * VPN: [[https://svti.fel.cvut.cz/cz/services/vpn.html|FEL VPN]], [[https://ist.cvut.cz/nase-sluzby/pripojeni-z-domova-vpn|CTU VPN]] ===== Course Schedule and Materials ===== ^ # ^ date ^ Lecture ^ Practical class ^ HW deadline ^ | 1 | 23.09.2024 | {{ :courses:b4m36ds2:b4m36ds2-lecture-01-introduction.pdf |Introduction: Big Data, NoSQL Databases}} | {{ :courses:be4m36ds2:b4m36ds2-lab-01-introduction.pdf |Relational DBS vs NoSQL DBS}} | | | 2 | 30.09.2024 | {{ :courses:b4m36ds2:b4m36ds2-lecture-02-types_of_nosql.pdf |Types of NoSQL data stores. Polyglot persistence}} {{ :courses:b4m36ds2:b4m36ds2-lecture-02-data_formats.pdf |Data Formats: XML, JSON, BSON, RDF}} | {{ :courses:b4m36ds2:b4m36ds2-lab-01-formats.pdf|Types of NoSQL data stores. Polyglot persistence}} | HW0 6.10.2024 | | 3 | 07.10.2024 | {{ :courses:b4m36ds2:b4m36ds2-lecture-03-principles.pdf |Basic Principles: Scaling, Sharding, Replication, CAP Theorem, Consistency}} | {{ :courses:b4m36ds2:b4m36ds2-lab-02-formats.pdf |Data Formats}} {{ :courses:be4m36ds2:lab-01-formats-solutions.zip | Solutions}} {{ :courses:be4m36ds2:lab2-example.zip | Example}} | HW1 20.10.2024 | | 4 | 14.10.2024 | {{ :courses:b4m36ds2:b4m36ds2-lecture-04_-_redis.pdf |Key-value stores: Redis. Data types}} | {{ :courses:b4m36ds2:b4m36ds2-lab-04-redis.pdf |Key-value stores: Redis. Data types}} {{ :courses:be4m36ds2:solutions-4.txt |Solutions}}| HW2 3.11.2024 | | 5 | 21.10.2024 | {{ :courses:be4m36ds2:b4m36ds2-lecture-06-redis-.pdf |Key-value stores: Redis}} | {{ :courses:be4m36ds2:b4m36ds2-lab-06-redis.pdf |Key-value stores: Redis}} {{ :courses:be4m36ds2:solutions-6.txt |Solutions}}| | | 6 | 28.10.2024 | State holiday | | | | 7 | 11.11.2024 | {{ :courses:b4m36ds2:b4m36ds2-lecture-8-mongodb.pdf |Document Databases: MongoDB}} | {{ :courses:be4m36ds2:b4m36ds2-lab-08-mongodb.pdf |MongoDB}} {{ :courses:be4m36ds2:solutions-8.txt |Solutions}}| HW3 24.11.2024 | | 8 | 18.11.2024 | {{ :courses:be4m36ds2:b4m36ds2-lecture-9-mongodb_.pdf |Document Databases: MongoDB. Aggregation}} | {{ :courses:be4m36ds2:b4m36ds2-lab-9-mongodb.pdf |MongoDB}} {{ :courses:be4m36ds2:data_solutins-9.zip |Data and solutions}}| | | 9 | 04.11.2024 | {{ :courses:be4m36ds2:b4m36ds2-lecture-07-cassandra.pdf |Wide Column Stores: Cassandra}} | {{ :courses:be4m36ds2:b4m36ds2-lab-11-mongodb_.pdf |MongoDB. Aggregation}} {{ :courses:be4m36ds2:data_solution10.zip |Data and Solutions}}| HW4 1.12.2024 | | 10 | 25.11.2024 | {{ :courses:be4m36ds2:b4m36ds2-lecture-10-neo4j.pdf |Graph Databases: Neo4j. Traversal Framework}} | {{ :courses:be4m36ds2:b4m36ds2-lab-07-lecture-cassandra.pdf |Cassandra}} {{ :courses:be4m36ds2:solutions-7.txt |Solutions}}| | | 11 | 02.12.2024 | {{ :courses:be4m36ds2:b4m36ds2-lecture-11-cypher.pdf |Graph Databases: Neo4j. Cypher}} | {{ :courses:be4m36ds2:b4m36ds2-lab-11-neo4j.pdf |Neo4j. Cypher}} {{ :courses:be4m36ds2:data_solution-11.zip |Data and Solutions}}| HW5 15.12.2024 | | 12 | 09.12.2024 | {{ :courses:be4m36ds2:b4m36ds2-lecture-12-advanced.pdf |Advanced Aspects: Graph Databases, Performance Tuning}} | {{ :courses:be4m36ds2:b4m36ds2-lab-12-neo4j-new.pdf |Neo4j}} | | | 13 | 16.12.2024 | {{ :courses:be4m36ds2:b4m36ds2-lecture-03-mapreduce.pdf |Apache Hadoop: MapReduce, HDFS}} | {{ :courses:be4m36ds2:b4m36ds2-lab-03-lecture-mapreduce.pdf |MapReduce, HDFS}} {{ :courses:be4m36ds2:libraries_java_files.zip |Libraries & Java files}} | HW6 22.12.2024 | | 14 | 06.01.2024 | {{ :courses:be4m36ds2:b4m36ds2-lecture-12-sparql.pdf |RDF Stores. SPARQL}} | Demonstration of the project | | The course uses [[https://www.ksi.mff.cuni.cz/~svoboda/courses/211-B4M36DS2/| materials of Martin Svoboda]], the former lecturer of this course. ===== Tools for Practical Classes and Homework ===== JSON: [[https://codebeautify.org/jsonvalidator|JSON Validator]] MongoDB: [[https://mongoplayground.net|Mongo Playground]] Neo4j: [[https://console.neo4j.org|Neo4j Console]] SPARQL: [[https://nosql.opendata.cz/sparql|SPARQL Query Editor]] ===== Formal Requirements ===== * **Attendance** during lectures and practical classes is **recommended** but not compulsory * **Homework** * Altogether **7 mandatory individual homework assignments** will be given during the semester (HW0 - HW6). They are parts of the same project * Everyone must **choose** their **distinct topic**, not later than 6. 10. 2024. The lecturer must accept the topic. * You can organize into **teams** of up to 3 students to work on the project * If a team performs the project, all quantity standards (data, queries, etc.) must be multiplied by the number of students * Solutions must be submitted via a script executed on the NoSQL server. The same script with a screenshot must be submitted to **BRUTE** * Solutions can be submitted even repeatedly (before the deadline), only **the latest version is assessed** * Once the lecturer assesses a given assignment, it cannot be resubmitted once again * **Delay** of one whole day **is penalized** by 5 points, shorter delays are penalized proportionally. Should the delay be even longer, the penalty stays the same and does not further increase * All the homework assignments must be submitted before the intended exam date to be considered * During some of the practical classes, **extra activity points** can be acquired * Half of all the points above 100 is transferred as **bonus points** to the exam * **Exam** * Only students with a course credit already acquired can sign up for the final exam * The **final exam** consists of a compulsory written test and an optional oral examination. * The **written test** consists of a theoretical part (open and multiple choice questions) and a practical part (exercises) * Having **less than 30% points** from any of the two parts **prevents from passing the exam successfully** * Based on the result, everyone can voluntarily undergo an **oral examination**. The only condition is to have at least 50 points * In such a case, the final score is further adjusted by **up to minus 10 to plus 5 points** * The examiner can also request the oral examination in case of uncertainties in the test ===== Assessment ===== The grade is calculated as the total points for homework, exam, and bonus points. ^ Assignment ^ Maximum ^ Minimum ^ | Homework | 32 | 20 | | Exam | 70 | 30 | | Bonus | 10 | 0 | ^ Total ^ 112 ^ 50 ^ Final grade: **90 points and more for A, 80+ for B, 70+ for C, 60+ for D, and 50+ for E** ===== Exam Requirements ===== ==== NoSQL Introduction ==== * **Big Data and NoSQL** terms, **V characteristics** (volume, variety, velocity, veracity, value, validity, volatility), **current trends** and challenges (Big Data, Big Users, processing paradigms, ...), principles of **relational databases** (functional dependencies, normal forms, transactions, ACID properties); **types of NoSQL systems** (key-value, wide column, document, graph, ...), their data models, features and use cases; **common features** of NoSQL systems (aggregates, schemalessness, scaling, flexibility, sharding, replication, automated maintenance, eventual consistency, ...) ==== Data Formats ==== * **XML**: constructs (element, attribute, text, ...), content model (empty, text, elements, mixed), entities, well-formedness; document and data oriented XML * **JSON**: constructs (object, array, value), types of values (strings, numbers, ...); **BSON**: document structure (elements, type selectors, property names and values) * **RDF**: data model (resources, referents, values), triples (subject, predicate, object), statements, blank nodes, IRI identifiers, literals (types, language tags); graph representation (vertices, edges); **N-Triples notation** (RDF file, statements, triple components, literals, IRI references); **Turtle notation** (TTL file, prefix definitions, triples, object and predicate-object lists, blank nodes, prefixed names, literals) * **CSV**: constructs (document, header, record, field) ==== MapReduce ==== * **Programming models**, paradigms and languages; parallel programming models, process interaction (shared memory, message passing, implicit interaction), problem decomposition (task parallelism, data parallelism, implicit parallelism) * **MapReduce**: programming model (data parallelism, map and reduce functions), **cluster architecture** (master, workers, message passing, data distribution), **map and reduce functions** (input arguments, emission and reduction of intermediate key-value pairs, final output), **data flow phases** (mapping, shuffling, reducing), input parsing (input file, split, record), **execution steps** (parsing, mapping, partitioning, combining, merging, reducing), **combine function** (commutativity, associativity), additional functions (input reader, partition, compare, output writer), **implementation details** (counters, fault tolerance, stragglers, task granularity), usage patterns (aggregation, grouping, querying, sorting, ...) * Apache **Hadoop**: modules (Common, HDFS, YARN, MapReduce), related projects (Cassandra, HBase, ...); **HDFS**: data model (hierarchical namespace, directories, files, blocks, permissions), architecture (NameNode, DataNode, HeartBeat messages, failures), replica placement (rack-aware strategy), FsImage (namespace, mapping of blocks, system properties) and EditLog structures, FS commands (ls, mkdir, ...); **MapReduce**: architecture (JobTracker, TaskTracker), job implementation (Configuration; Mapper, Reducer, and Combiner classes; Context, write method; Writable and WritableComparable interfaces), job execution schema ==== NoSQL Principles ==== * **Scaling**: scalability definition; **vertical scaling** (scaling up/down), pros and cons (performance limits, higher costs, vendor lock-in, ...); **horizontal scaling** (scaling out/in), pros and cons, **network fallacies** (reliability, latency, bandwidth, security, ...), **cluster** architecture; design questions (scalability, availability, consistency, latency, durability, resilience) * **Distribution** models: **sharding**: idea, motivation, objectives (balanced distribution, workload, ...), strategies (mapping structures, general rules), difficulties (evaluation of requests, changing cluster structure, obsolete or incomplete knowledge, network partitioning, ...); **replication**: idea, motivation, objectives, replication factor, architectures (master-slave and peer-to-peer), internal details (handling of read and write requests, consistency issues, failure recovery), replica placement strategies; mutual combinations of **sharding and replication** * **CAP** theorem: CAP guarantees (consistency, availability, partition tolerance), CAP theorem, consequences (CA, CP and AP systems), consistency-availability spectrum, **ACID properties** (atomicity, consistency, isolation, durability), **BASE properties** (basically available, soft state, eventual consistency) * **Consistency**: strong vs. eventual consistency; **write consistency** (write-write conflict, context, pessimistic and optimistic strategies), **read consistency** (read-write conflict, context, inconsistency window, session consistency), **read and write quora** (formulae, motivation, workload balancing) ==== Key-Value Stores ==== * Data model (key-value pairs), **key management** (real-world identifiers, automatically generated, structured keys, prefixes), basic CRUD operations, use cases, representatives, extended functionality (MapReduce, TTL, links, structured store, ...) * **Redis**: features (in-memory, data structure store), data model (databases, objects), data types (string, list, set, sorted set, hash), **string** commands (SET, GET, APPEND, SETRANGE, INCR, DEL, ...), **list** commands (LPUSH, RPUSH, LPOP, RPOP, LINDEX, LRANGE, LREM, ...), **set** commands (SADD, SISMEMBER, SUNION, SINTER, SDIFF, SREM, ...), **sorted set** commands (ZADD, ZRANGE, ZRANGEBYSCORE, ZINCRBY, ZREM, ...), **hash** commands (HSET, HMSET, HGET, HMGET, HKEYS, HVALS, HDEL, ...), **general** commands (EXISTS, KEYS, DEL, RENAME, ...), **time-to-live** commands (EXPIRE, TTL, PERSIST) ==== Wide Column Stores ==== * Data model (column families, rows, columns), query patterns, use cases, representatives * **Cassandra**: data model (keyspaces, tables, rows, columns), primary keys (partition key, clustering columns), column values (missing; empty; native data types, tuples, user-defined types; collections: lists, sets, maps; frozen mode), additional data (TTL, timestamp); **CQL** language: DDL statements: **CREATE KEYSPACE** (replication options), DROP KEYSPACE, USE keyspace, **CREATE TABLE** (column definitions, usage of types, primary key), DROP TABLE, TRUNCATE TABLE; native data types (int, varint, double, boolean, text, timestamp, ...); literals (atomic, collections, ...); DML statements: **SELECT** statements (SELECT, FROM, WHERE, GROUP BY, ORDER BY, and LIMIT clauses; DISTINCT modifier; selectors; non/filtering queries, ALLOW FILTERING mode; filtering relations; aggregates; restrictions on sorting and aggregation), **INSERT** statements (update parameters: TTL, TIMESTAMP), **UPDATE** statements (assignments; modification of collections: additions, removals), **DELETE** statements (deletion of rows, removal of columns, removal of items from collections) ==== Document Stores ==== * Data model (documents), query patterns, use cases, representatives * **MongoDB**: data model (databases, collections, documents, field names), document identifiers (features, ObjectId), data modeling (embedded documents, references); CRUD operations (insert, update, save, remove, find); **insert** operation (management of identifiers); **update** operation: replace vs. update mode, multi option, upsert mode, update operators (field: set, rename, inc, ...; array: push, pop, ...); **save** operation (insert vs. replace mode); **remove** operation (justOne option); **find** operation: query conditions (value equality vs. query operators), query operators (comparison: eq, ne, ...; element: exists; evaluation: regex, ...; logical: and, or, not; array: all, elemMatch, ...), dot notation (embedded fields, array items), querying of arrays, projection (positive, negative), projection operators (array: slice, elemMatch), **modifiers** (sort, skip, limit); **MapReduce** (map function, reduce function, options: query, sort, limit, out); primary and secondary **index structures** (index types: value, hashed, ...; forms; properties: unique, partial, sparse, TTL) ==== Graph Databases ==== * Data model (property graphs), use cases, representatives * **Neo4j**: data model (graph, nodes, relationships, directions, labels, types, properties), properties (fields, atomic values, arrays); embedded database mode; **traversal framework**: traversal description, **order** (breadth-first, depth-first, branch ordering policies), **expanders** (relationship types, directions), **uniqueness** (NODE_GLOBAL, RELATIONSHIP_GLOBAL, ...), **evaluators** (INCLUDE/EXCLUDE and CONTINUE/PRUNE results; predefined evaluators: all, excludeStartPosition, ...; custom evaluators: evaluate method), traverser (starting nodes, iteration modes: paths, end nodes, last relationships); Java interface (labels, types, nodes, relationships, properties, transactions); **Cypher** language: graph matching (solutions, variable bindings); query sub/clauses (read, write, general); **path patterns**, node patterns (variable, labels, properties), relationship patterns (variable, types, properties, variable length); **MATCH** clause (path patterns, WHERE conditions, uniqueness requirement, OPTIONAL mode); **RETURN** clause (DISTINCT modifier, ORDER BY, LIMIT, SKIP subclauses, aggregation); **WITH** clause (motivation, subclauses); **write clauses**: CREATE, DELETE (DETACH mode), SET (properties, labels), REMOVE (properties, labels); query structure (chaining of clauses, query parts, restrictions) ==== Advanced Aspects ==== * **Graph databases**: non/transactional databases, query patterns (CRUD, graph algorithms, graph traversals, graph pattern matching, similarity querying); data **structures** (adjacency matrix, adjacency list, incidence matrix, Laplacian matrix), graph **traversals**, **data locality** (BFS layout, matrix bandwidth, bandwidth minimization problem, Cuthill-McKee algorithm), **graph partitioning** (1D partitioning, 2D partitioning, BFS evaluation), graph matching (sub-graph, super-graph patterns), non/mining based indexing * **Performance tuning**: scalability goals (reduce latency, increase throughput), Amdahl's law, Little's law, message cost model * **Polyglot persistence** ==== RDF Stores ==== * **Linked Data**: principles (identification, standard formats, interlinking, open license), Linked Open Data Cloud * **SPARQL**: graph pattern matching (solution sequence, solution, variable binding, compatibility of solutions), **graph patterns** (basic, group, optional, alternative, graph, minus); **prologue declarations** (BASE, PREFIX clauses), **SELECT** queries (SELECT, FROM, and WHERE clauses), query **dataset** (default graph, named graphs), **variable assignments** (BIND), **FILTER** constraints (comparisons, logical connectives, accessors, tests, ...), **solution modifiers** (DISTINCT, REDUCED; aggregation: GROUP BY, HAVING; sorting: ORDER BY, LIMIT, OFFSET), **query forms** (SELECT, ASK, DESCRIBE, CONSTRUCT) ===== Recommended Literature ===== * Holubová, Irena - Kosek, Jiří - Minařík, Karel - Novák, David: [[http://www.ksi.mff.cuni.cz/bigdata/|Big Data a NoSQL databáze]].\\ ISBN: 978-80-247-5466-6 (hardcover), 978-80-247-5938-8 (eBook PDF), 978-80-247-5939-5 (eBook EPUB).\\ Grada Publishing, a.s., 2015. * Sadalage, Pramod J. - Fowler, Martin: [[http://martinfowler.com/books/nosql.html|NoSQL Distilled]].\\ ISBN: 978-0-321-82662-6.\\ Pearson Education, Inc., 2013. * Wiese, Lena: [[https://www.degruyter.com/viewbooktoc/product/460529|Advanced Data Management: For SQL, NoSQL, Cloud and Distributed Databases]].\\ ISBN: 978-3-11-044140-6 (hardcover), 978-3-11-044141-3 (eBook PDF), 978-3-11-043307-4 (eBook EPUB).\\ DOI: [[http://doi.org/10.1515/9783110441413|10.1515/9783110441413]].\\ Walter de Gruyter GmbH, 2015. * Zomaya, Albert Y. - Sakr, Sherif: [[http://www.springer.com/gp/book/9783319493398|Handbook of Big Data Technologies]].\\ ISBN: 978-3-319-49339-8 (hardcover), 978-3-319-49340-4 (eBook).\\ DOI: [[http://doi.org/10.1007/978-3-319-49340-4|10.1007/978-3-319-49340-4]].\\ Springer International Publishing AG, 2017.