B4M36DS2, BE4M36DS2: Database Systems 2

Basic Information

Annotations: B4M36DS2, BE4M36DS2 (English)
Lecturer and tutor: Yuliia Prokop prokoyul@fel.cvut.cz
Schedule: B4M36DS2, BE4M36DS2
Upload system: BRUTE
VPN: FEL VPN, CTU VPN

Course Schedule and Materials

#	date	Lecture	Practical class	HW deadline
1	22.09.2025	Introduction: Big Data, NoSQL Databases	Multi-Model Data & JSONB in PostgreSQL
2	29.09.2025	Basic principles: Scaling, Sharding, Replication Data Formats: XML, JSON, BSON, RDF	Data Formats: XML, JSON, BSON, Parquet Exercise2 solutions	HW0 12.10.2025
3	06.10.2025	CAP Theorem, Consistency	Basic principles: Scaling, Sharding, Replication Dataset	BA#1 21.12.2025
4	13.10.2025	Apache Hadoop: MapReduce, HDFS	Apache Spark Dataset PySpark MapReduce, HDFS
5	20.10.2025	Types of NoSQL databases	Types of NoSQL databases	HW1 26.10.2025
6	27.10.2025	Key-value stores: Redis	Redis	HW2 9.11.2025, HW4 23.11.2025
7	03.11.2025	Document Databases: MongoDB. CRUD operations	MongoDB. CRUD operations Dataset+Solution
8	10.11.2025	Document Databases: MongoDB. Aggregation	MongoDB. Aggregation Dataset+Solution	HW3, HW4 ~~23.11.2025~~ 30.11.2025
9	17.11.2025	State holiday
10	24.11.2025	Wide Column Stores: Cassandra	Cassandra Solutions	HW5 7.12.2025
11	01.12.2025	Graph Databases: Traversal Framework Graph Databases: Neo4j.Cypher	Test (Redis, MongoDB) Neo4j.Cypher Data and Solutions	HW6 14.12.2025
12	08.12.2025	Advanced Aspects: Graph Databases, Performance Tuning	Neo4j+Python Neo4j+Java Examples
13	15.12.2025	Transactions. Query optimization	Query optimization	HW7 21.12.2025, HW8 4.1.2026
14	05.01.2025	RDF Stores. SPARQL	Demonstration of the project

The course uses materials of Martin Svoboda, the former lecturer of this course.

Tools for Practical Classes and Homework

JSON: JSON Validator

Redis online editor: Redis OneCompiler

MongoDB: Mongo Playground MongoDB OneCompiler

Neo4j: Neo4j Console

SPARQL: SPARQL Query Editor

PySpark: PySpark functions

Formal Requirements

Attendance during lectures and practical classes is recommended but not compulsory

Homework
- Altogether, 8 mandatory individual homework assignments will be given during the semester (HW0—HW7). They are parts of the same project
- There are also two optional homework assignments that can be submitted for extra points
- You can organize into teams of up to 3 students to work on the project
- If a team carries out the project, the number of queries must be increased (see description of the project)
- Solutions must be submitted via a script executed on the NoSQL server. The text report must be submitted to BRUTE
- Solutions can be submitted repeatedly (before the deadline); only the latest version is assessed
- Once the lecturer has assessed a given assignment, it cannot be resubmitted
- A delay of one whole day is penalized by 0.4 points; shorter delays are penalized proportionally. A delay of two days is penalized by 0.8 points, etc., up to 2 points
- All homework assignments must be submitted before the intended exam date to be considered

Exam
- Only students who have already acquired course credit can sign up for the final exam
- The final exam consists of a compulsory written test and an optional oral examination
- Based on the result, everyone can voluntarily undergo an oral examination. The only condition is to have at least 50 total points
- In such a case, the final score is further adjusted by up to minus 10 to plus 5 points
- The examiner can also request an oral examination in case of uncertainties in the test

Assessment

The grade is calculated as the total points for homework, exam, and bonus points.

Assignment	Maximum	Minimum
Homework	42	25
Optional HW	18	0
Tests	10	5
Practice	2	0
Exam	30	20
Total	102	50

Final grade: 90 points and more for A, 80+ for B, 70+ for C, 60+ for D, and 50+ for E

Exam Details

The exam will be conducted on computers in the classroom (KN E-328) using Moodle (https://moodle.fel.cvut.cz/course/view.php?id=9765) :

all theoretical concepts studied throughout the course;
work with four types of NoSQL systems: Redis, MongoDB, Cassandra, and Neo4j;
time limit is 60 minutes.

The exam includes:

Questions with either a single correct answer (single-choice) or multiple correct answers (multiple-choice);
Open-ended questions.

Additional notes:

Students may supplement their answers with handwritten illustrations or comments
Exam results will be posted on both Moodle and BRUTE within 1 day

After receiving results, students may:

Request an oral examination (which can adjust their written exam score by ±5 points)
Register to retake the exam

Exam Requirements

NoSQL Introduction

Big Data and NoSQL terms, V characteristics (volume, variety, velocity, veracity, value), current trends and challenges (Big Data, Big Users, processing paradigms, …), principles of relational databases (functional dependencies, normal forms, transactions, ACID properties); types of NoSQL systems (key-value, wide column, document, graph, …), their data models, features and use cases; common features of NoSQL systems (aggregates, schemalessness, scaling, flexibility, sharding, replication, automated maintenance, eventual consistency, …)

Data Formats

XML: constructs (element, attribute, text, …), content model (empty, text, elements, mixed), entities, well-formedness; document and data oriented XML
JSON: constructs (object, array, value), types of values (strings, numbers, …); BSON: document structure (elements, type selectors, property names and values)
RDF: data model (resources, referents, values), triples (subject, predicate, object), statements, blank nodes, IRI identifiers, literals (types, language tags); graph representation (vertices, edges); N-Triples notation (RDF file, statements, triple components, literals, IRI references); Turtle notation (TTL file, prefix definitions, triples, object and predicate-object lists, blank nodes, prefixed names, literals)
CSV: constructs (document, header, record, field)

MapReduce

Programming models, paradigms and languages; parallel programming models, process interaction (shared memory, message passing, implicit interaction), problem decomposition (task parallelism, data parallelism, implicit parallelism)
MapReduce: programming model (data parallelism, map and reduce functions), cluster architecture (master, workers, message passing, data distribution), map and reduce functions (input arguments, emission and reduction of intermediate key-value pairs, final output), data flow phases (mapping, shuffling, reducing), input parsing (input file, split, record), execution steps (parsing, mapping, partitioning, combining, merging, reducing), combine function (commutativity, associativity), additional functions (input reader, partition, compare, output writer), implementation details (counters, fault tolerance, stragglers, task granularity), usage patterns (aggregation, grouping, querying, sorting, …)
Apache Hadoop: modules (Common, HDFS, YARN, MapReduce), related projects (Cassandra, HBase, …); HDFS: data model (hierarchical namespace, directories, files, blocks, permissions), architecture (NameNode, DataNode, HeartBeat messages, failures), replica placement (rack-aware strategy), FsImage (namespace, mapping of blocks, system properties) and EditLog structures, FS commands (ls, mkdir, …); MapReduce: architecture (JobTracker, TaskTracker), job implementation (Configuration; Mapper, Reducer, and Combiner classes; Context, write method; Writable and WritableComparable interfaces), job execution schema

NoSQL Principles

Scaling: scalability definition; vertical scaling (scaling up/down), pros and cons (performance limits, higher costs, vendor lock-in, …); horizontal scaling (scaling out/in), pros and cons, network fallacies (reliability, latency, bandwidth, security, …), cluster architecture; design questions (scalability, availability, consistency, latency, durability, resilience)
Distribution models: sharding: idea, motivation, objectives (balanced distribution, workload, …), strategies (mapping structures, general rules), difficulties (evaluation of requests, changing cluster structure, obsolete or incomplete knowledge, network partitioning, …); replication: idea, motivation, objectives, replication factor, architectures (master-slave and peer-to-peer), internal details (handling of read and write requests, consistency issues, failure recovery), replica placement strategies; mutual combinations of sharding and replication
CAP theorem: CAP guarantees (consistency, availability, partition tolerance), CAP theorem, consequences (CA, CP and AP systems), consistency-availability spectrum, ACID properties (atomicity, consistency, isolation, durability), BASE properties (basically available, soft state, eventual consistency)
Consistency: strong vs. eventual consistency; write consistency (write-write conflict, context, pessimistic and optimistic strategies), read consistency (read-write conflict, context, inconsistency window, session consistency), read and write quora (formulae, motivation, workload balancing)

Key-Value Stores

Data model (key-value pairs), key management (real-world identifiers, automatically generated, structured keys, prefixes), basic CRUD operations, use cases, representatives, extended functionality (MapReduce, TTL, links, structured store, …)
- Redis: features (in-memory, data structure store), data model (databases, objects), data types (string, list, set, sorted set, hash), string commands (SET, GET, APPEND, SETRANGE, INCR, DEL, …), list commands (LPUSH, RPUSH, LPOP, RPOP, LINDEX, LRANGE, LREM, …), set commands (SADD, SISMEMBER, SUNION, SINTER, SDIFF, SREM, …), sorted set commands (ZADD, ZRANGE, ZRANGEBYSCORE, ZINCRBY, ZREM, …), hash commands (HSET, HMSET, HGET, HMGET, HKEYS, HVALS, HDEL, …), general commands (EXISTS, KEYS, DEL, RENAME, …), time-to-live commands (EXPIRE, TTL, PERSIST)

Wide Column Stores

Data model (column families, rows, columns), query patterns, use cases, representatives
Cassandra: data model (keyspaces, tables, rows, columns), primary keys (partition key, clustering columns), column values (missing; empty; native data types, tuples, user-defined types; collections: lists, sets, maps; frozen mode), additional data (TTL, timestamp); CQL language: DDL statements: CREATE KEYSPACE (replication options), DROP KEYSPACE, USE keyspace, CREATE TABLE (column definitions, usage of types, primary key), DROP TABLE, TRUNCATE TABLE; native data types (int, varint, double, boolean, text, timestamp, …); literals (atomic, collections, …); DML statements: SELECT statements (SELECT, FROM, WHERE, GROUP BY, ORDER BY, and LIMIT clauses; DISTINCT modifier; selectors; non/filtering queries, ALLOW FILTERING mode; filtering relations; aggregates; restrictions on sorting and aggregation), INSERT statements (update parameters: TTL, TIMESTAMP), UPDATE statements (assignments; modification of collections: additions, removals), DELETE statements (deletion of rows, removal of columns, removal of items from collections)

Document Stores

Data model (documents), query patterns, use cases, representatives
MongoDB: data model (databases, collections, documents, field names), document identifiers (features, ObjectId), data modeling (embedded documents, references); CRUD operations (insert, update, save, remove, find); insert operation (management of identifiers); update operation: replace vs. update mode, multi option, upsert mode, update operators (field: set, rename, inc, …; array: push, pop, …); save operation (insert vs. replace mode); remove operation (justOne option); find operation: query conditions (value equality vs. query operators), query operators (comparison: eq, ne, …; element: exists; evaluation: regex, …; logical: and, or, not; array: all, elemMatch, …), dot notation (embedded fields, array items), querying of arrays, projection (positive, negative), projection operators (array: slice, elemMatch), modifiers (sort, skip, limit); MapReduce (map function, reduce function, options: query, sort, limit, out); primary and secondary index structures (index types: value, hashed, …; forms; properties: unique, partial, sparse, TTL)

Graph Databases

Data model (property graphs), use cases, representatives
Neo4j: data model (graph, nodes, relationships, directions, labels, types, properties), properties (fields, atomic values, arrays); embedded database mode; traversal framework: traversal description, order (breadth-first, depth-first, branch ordering policies), expanders (relationship types, directions), uniqueness (NODE_GLOBAL, RELATIONSHIP_GLOBAL, …), evaluators (INCLUDE/EXCLUDE and CONTINUE/PRUNE results; predefined evaluators: all, excludeStartPosition, …; custom evaluators: evaluate method), traverser (starting nodes, iteration modes: paths, end nodes, last relationships); Java interface (labels, types, nodes, relationships, properties, transactions); Cypher language: graph matching (solutions, variable bindings); query sub/clauses (read, write, general); path patterns, node patterns (variable, labels, properties), relationship patterns (variable, types, properties, variable length); MATCH clause (path patterns, WHERE conditions, uniqueness requirement, OPTIONAL mode); RETURN clause (DISTINCT modifier, ORDER BY, LIMIT, SKIP subclauses, aggregation); WITH clause (motivation, subclauses); write clauses: CREATE, DELETE (DETACH mode), SET (properties, labels), REMOVE (properties, labels); query structure (chaining of clauses, query parts, restrictions)

Advanced Aspects

Graph databases: non/transactional databases, query patterns (CRUD, graph algorithms, graph traversals, graph pattern matching, similarity querying); data structures (adjacency matrix, adjacency list, incidence matrix, Laplacian matrix), graph traversals, data locality (BFS layout, matrix bandwidth, bandwidth minimization problem, Cuthill-McKee algorithm), graph partitioning (1D partitioning, 2D partitioning, BFS evaluation), graph matching (sub-graph, super-graph patterns), non/mining based indexing
Performance tuning: scalability goals (reduce latency, increase throughput), Amdahl's law, Little's law, message cost model
Polyglot persistence

RDF Stores

Linked Data: principles (identification, standard formats, interlinking, open license), Linked Open Data Cloud
SPARQL: graph pattern matching (solution sequence, solution, variable binding, compatibility of solutions), graph patterns (basic, group, optional, alternative, graph, minus); prologue declarations (BASE, PREFIX clauses), SELECT queries (SELECT, FROM, and WHERE clauses), query dataset (default graph, named graphs), variable assignments (BIND), FILTER constraints (comparisons, logical connectives, accessors, tests, …), solution modifiers (DISTINCT, REDUCED; aggregation: GROUP BY, HAVING; sorting: ORDER BY, LIMIT, OFFSET), query forms (SELECT, ASK, DESCRIBE, CONSTRUCT)

Recommended Literature

Holubová, Irena - Kosek, Jiří - Minařík, Karel - Novák, David: Big Data a NoSQL databáze.

ISBN: 978-80-247-5466-6 (hardcover), 978-80-247-5938-8 (eBook PDF), 978-80-247-5939-5 (eBook EPUB).
Grada Publishing, a.s., 2015.

Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled.

ISBN: 978-0-321-82662-6.
Pearson Education, Inc., 2013.

Wiese, Lena: *Advanced Data Management: For SQL, NoSQL, Cloud and Distributed Databases.

ISBN: 978-3-11-044140-6 (hardcover), 978-3-11-044141-3 (eBook PDF), 978-3-11-043307-4 (eBook EPUB).
DOI: 10.1515/9783110441413.
Walter de Gruyter GmbH, 2015.

Zomaya, Albert Y. - Sakr, Sherif: Handbook of Big Data Technologies.

ISBN: 978-3-319-49339-8 (hardcover), 978-3-319-49340-4 (eBook).
DOI: 10.1007/978-3-319-49340-4.
Springer International Publishing AG, 2017.

Table of Contents