B4M36DS2, BE4M36DS2: Database Systems 2

Basic Information

Annotations: B4M36DS2, BE4M36DS2 (English)
Lecturer and tutor: Yuliia Prokop
Schedule: B4M36DS2, BE4M36DS2
- Lectures: Monday 9:15 - 10:45 (KN:E-301) (English)
- Practical classes (group 101): Monday 12:45 - 14:15 (KN:E-328) (Czech)
- Practical classes (group 102): Monday 14:30 - 16:00 (KN:E-328) (English)
- Practical classes (group 103): Monday 16:15 - 17:45 (KN:E-328) (Czech)
Table with points from practical classes, homework assignments and exam tests UPDATED

Exam Dates

Thursday 12. 1. 2023: 14:00 - 15:30 (online) Results 12/1/2023 UPDATED
Questions and (optional) oral examination - Monday 23. 01. 2023 : 9:15 - 12:00 (KN:E-301) or Wednesday 1. 2. 2023 : 9:15 - 12:00 (KN:E-328)

Monday 16. 1. 2023: 9:15 - 11:45 (KN:E-301) Results 16/1/2023
Questions and (optional) oral examination - Monday 23. 01. 2023 : 9:15 - 12:00 (KN:E-301) or Wednesday 1. 2. 2023 : 9:15 - 12:00 (KN:E-328)

Monday 23. 1. 2023: 9:15 - 11:45 (KN:E-301) Results 23/1/2023
Questions and (optional) oral examination - Wednesday 1. 2. 2023 : 9:15 - 12:00 (KN:E-328) or Wednesday 15. 2. 2023: 9:15 - 11:45 (KN:E-328)

Wednesday 1. 2. 2023: 9:15 - 11:45 (KN:E-328) Results 1/2/2023
Questions and (optional) oral examination - Wednesday 15. 2. 2023 : 9:15 - 12:00 (KN:E-328)

Wednesday 15. 2. 2023: 9:15 - 11:45 (KN:E-328)Results 15-16/2/2023

Homework Deadlines

00 - Topic selection: Monday 4. 10. 2022 until 23:59
01 - XPath: Monday 10. 10. 2022 until 23:59
02 - XQuery: Monday 17. 10. 2022 until 23:59
03 - SPARQL: Monday 24. 10. 2022 until 23:59
04 - MapReduce: Monday 7. 11. 2022 until 23:59
05 - Redis: Monday 7. 11. 2022 until 23:59
06 - Cassandra: Monday 14. 11. 2022 until 23:59
07 - MongoDB: Monday 28. 11. 2022 until 23:59
08 - MongoDB-2: Monday 5. 12. 2022 until 23:59
09 - Neo4j: Monday 12. 12. 2022 until 23:59

Lectures

19. 09. 2022: 01 - Introduction: Big Data, NoSQL Databases
26. 09. 2022: 02 - Data Formats: XML, JSON, BSON, RDF
03. 10. 2022: 03 - XML Databases: XPath
10. 10. 2022: 04 - XML Databases: XQuery
17. 10. 2022: 05 - RDF Stores: SPARQL
24. 10. 2022: 06 - Apache Hadoop: MapReduce, HDFS
31. 10. 2022: 07 - Basic Principles: Scaling, Sharding, Replication, CAP Theorem, Consistency
07. 11. 2022: 08 - Wide Column Stores: Cassandra: CQL
14. 11. 2022: 09 - Document Databases: MongoDB
21. 11. 2022: 10 - Document Databases: MongoDB: Aggregation
28. 11. 2022: 10 - Graph Databases: Neo4j: Traversal Framework
05. 12. 2022: 12 - Graph Databases: Neo4j: Cypher
12. 12. 2022: 13 - Advanced Aspects: Graph Databases, Performance Tuning
09. 01. 2023: Cancelled

Practical Classes

19. 09. 2022: 00 - Organization
26. 09. 2022: 01 - Formats
- Tools: XML Editor, JSON Editor, RDF Editor
- Solutions: Solutions
03. 10. 2022: 02 - XPath
- Data files: data.xml
- Tools: XPath and XQuery Processor
- Solutions: Solutions
10. 10. 2022: 03 - XQuery
- Data files: data.xml
- Tools: XPath and XQuery Processor
- Solutions: Solutions
17. 10. 2022: 04 - SPARQL
- Data files: data.ttl
- Solutions: Solutions
- SPARQL endpoint: https://nosql.opendata.cz/sparql
24. 10. 2022: 05 - MapReduce
- Source files: WordCount.java, InvertedIndex.java
- See /home/DS2/mapreduce/ directory for input data and Hadoop libraries
31. 10. 2022: 06 - Redis
07. 11. 2022: 07 - Cassandra
14. 11. 2022: 08 - MongoDB
21. 11. 2022: 09 - MongoDB
- Data file: data.js
- Solutions: queries.js
28. 11. 2022: 10 - MongoDB
- Data files: users.js, checkin.js
- Solutions: queries.js
05. 12. 2022: 11 - Neo4j
- Data files: data.cypher
- Solutions: queries.cypher
12. 12. 2022: 12 - Neo4j
- Solutions: MyNeo4jApp.java
09. 01. 2023: Cancelled

Formal Requirements

Attendance during lectures and practical classes is recommended but not compulsory
Altogether 9 individual homework assignments will be given during the semester
Everyone must choose their distinct topic, not later than during the XPath practical class
This topic must be reported to and explicitly accepted by the lecturer in advance
Possible topics could be: library, cinema, cookbook, university, flights, etc.
See the list below for additional suitable topics, feel free to choose your own topic
Your homework solutions must be within the topic, original, realistic, and non-trivial
Solutions can only be submitted via a script executed on the corresponding server
At most 150 points in total can be gained for all the homework assignments
Solutions are awarded by up to 20, 15 or 10 points respectively, depending on the assignment
In case of any shortcomings, fewer points will be awarded appropriately
Solutions can be submitted even repeatedly, only the latest version is assessed
Once a given assignment is assessed by the lecturer, it cannot be resubmitted once again
Delay of one whole day is penalized by 5 points, shorter delays are penalized proportionally
Should the delay be even longer, the penalty stays the same and does not further increase
All the homework assignments must be submitted before the intended exam date in order to be considered
None of the homework assignments is compulsory, yet you are encouraged to submit all of them
During some of the practical classes, extra activity points can be acquired, too
At least 130 points is required for the course credit to be granted
Half of all the points above this boundary is transferred as bonus points to the exam
Only students with a course credit already acquired can sign up for the final exam
The final exam consists of a compulsory written test and an optional oral examination
At most 100 points can be acquired from the actual final written test
This test consists of a theoretical part (open and multiple choice questions) and a practical part (exercises)
Having less than 30% points from any of the two parts prevents from passing the exam successfully
The final score corresponds to the sum of the written test and bonus points, if any
Based on the result, everyone can voluntarily choose to undergo an oral examination
The only condition is to have at least 50 points from the test and bonus points
In such a case, the final score is further adjusted by up to minus 10 to plus 5 points
The oral examination can also be requested by the examiner in case of uncertainties in the test
Final grade: 90 points and more for A, 80+ for B, 70+ for C, 60+ for D, and 50+ for E

Homework Assignments

Preliminaries:
- NoSQL server: nosql.felk.cvut.cz
- Login and password: sent by e-mail
Tools:
- PuTTY 0.70
- WinSCP 5.13
Submissions:
- Use sftp or WinSCP to upload your submission files to the NoSQL server
- Put these files into a directory ~/assignments/name/, where name is a name of a given homework
- I.e. xpath, xquery, sparql, mapreduce, riak, redis, cassandra, mongodb, neo4j (case sensitive)
- Use ssh or PuTTY to open a remote shell connection to the NoSQL server
- Based on the instructions provided for a given homework assignment, verify that everything is working as expected
- Go to the ~/assignments/ directory and execute sudo submit_execute name, where name is once again the name of the homework
- Wait for the confirmation of success, otherwise your homework is not considered to be submitted
- Should any complications appear, send your solution by e-mail to prokoyul@fel.cvut.cz
- Just for your convenience, you can check the submitted files in the ~/submissions/ directory
- Once the homework is assessed, you will find comments in this directory, too
Requirements:
- Respect the prescribed names of individual files to be submitted (case sensitive)
- Place all the files in the root directory of your submission
- Do not include shared libraries or files that are not requested
- I.e. do not submit files that were not explicitly requested
- Do not redirect or suppress both standard and error outputs in your shell scripts
- All your files must be syntactically correct and executable without errors

1: XPath

Points: 15
Assignment:
- Create an XML document with sample data from the domain of your individual topic
  - Work with mutually interlinked entities of at least 3 different types (e.g. lines, flights and tickets)
  - Insert data about at least 15 particular entities (e.g. 3 lines, 4 flights, 8 tickets)
- Create expressions for exactly 5 different XPath queries (i.e. not more, not less)
- Use each of the following constructs at least once
  - Axes: descendant or descendant-or-self or // abbreviation
  - Axes: ancestor(-or-self) or preceding(-sibling) or following(-sibling)
  - Predicates (all of the following): path expression (existence test), position testing, value comparison, general comparison
Requirements:
- Both the XML document and queries must be well-formed (i.e. syntactically correct)
- Put each XPath expression into a standalone file (e.g. xpath1.xp)
- Always add a comment describing the intended query meaning in natural language via (: comment :)
- Each query expression must be evaluated to a non-empty sequence
Submission:
- data.xml: XML document with your data to be queried
- xpath1.xp, …, xpath5.xp: files with XPath expressions
Execution:
- Execute the following shell command to evaluate each individual XPath query expression
  - saxonb-xquery -s $DataFile $QueryFile
  - $DataFile is the input XML document to be queried, i.e. data.xml
  - $QueryFile is a file with query expression to be evaluated, e.g. xpath1.xp
Tools:
- VideLibri XPath and XQuery Processor
- Code Beautify XML Validator
References:
- XML: Extensible Markup Language (XML) 1.1 (Second Edition) - W3C Recommendation (16 August 2006)
- XPath: XML Path Language (XPath) 3.1 - W3C Recommendation (21 March 2017)
Server: nosql.felk.cvut.cz
- Do not forget to execute the homework submission script!
Deadline: Sunday 9. 10. 2022 until 23:59

2: XQuery

Points: 15
Assignment:
- Create an XML document with sample data from the domain of your individual topic
  - Work with mutually interlinked entities of at least 3 different types (e.g. lines, flights and tickets)
  - Insert data about at least 15 particular entities (e.g. 3 lines, 4 flights, 8 tickets)
- This document may (or may not) be identical to the one from the previous assignment on XPath
- Create expressions for exactly 5 different XQuery queries (that cannot be expressed solely using XPath)
- Use each of the following constructs at least once
  - Direct or computed constructor
  - FLWOR expression (with at least one for, let, where and order by clauses)
  - Aggregate function (min, max, avg or sum)
  - Conditional expression
  - Existential or universal quantifier
Requirements:
- Both the XML document and queries must be well-formed (i.e. syntactically correct)
- Put each XQuery expression into a standalone file (e.g. xquery1.xq)
- Always add a comment describing the intended query meaning in natural language via (: comment :)
- Each query expression must be evaluated to a non-empty sequence
Submission:
- data.xml: XML document with your data to be queried
- xquery1.xq, …, xquery5.xq: files with XQuery expressions
Execution:
- Execute the following shell command to evaluate each individual XQuery query expression
  - saxonb-xquery -s $DataFile $QueryFile
  - $DataFile is the input XML document to be queried, i.e. data.xml
  - $QueryFile is a file with query expression to be evaluated, e.g. xquery1.xq
Tools:
- VideLibri XPath and XQuery Processor
- Code Beautify XML Validator
References:
- XML: Extensible Markup Language (XML) 1.1 (Second Edition) - W3C Recommendation (16 August 2006)
- XQuery: XQuery 3.1: An XML Query Language - W3C Recommendation (21 March 2017)
Server: nosql.felk.cvut.cz
- Do not forget to execute the homework submission script!
Deadline: Sunday 16. 10. 2022 until 23:59

3: SPARQL

Points: 20
Assignment:
- Create a TTL document with sample RDF triples within your individual topic
  - Use the Turtle notation in particular
  - Work with mutually interlinked resources of at least 3 different types (e.g. lines, flights and tickets)
  - Insert data about at least 15 particular resources (e.g. 3 lines, 4 flights, 8 tickets)
- Use each of the following constructs at least once
  - Object list or predicate-object list
  - Blank nodes (either using _ prefix or brackets [])
- Create expressions for exactly 5 different SPARQL queries (SELECT query form in particular)
- Use each of the following constructs at least once
  - Basic graph pattern
  - Group graph pattern
  - Optional graph pattern (OPTIONAL)
  - Alternative graph pattern (UNION)
  - Difference graph pattern (MINUS)
  - FILTER constraint
  - Aggregation (GROUP BY with or without HAVING clause)
  - Sorting (ORDER BY clause)
Requirements:
- Both TTL document and queries must be well-formed (i.e. syntactically correct)
- Put each SPARQL query expression into a standalone file (e.g. query1.sparql)
- Always add a comment describing the intended query meaning in natural language via # comment
- Each query expression must be evaluated to a non-empty solution sequence
- Both the data file a query files must contain declarations of all prefixes used, including rdf: and similar
  - Use @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . in your data file
  - Use PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> in your query file
- Do not use FROM clauses in your queries, the input data file will automatically be accessible as the default graph
Submission:
- data.ttl: TTL document with your RDF data to be queried
- query1.sparql, …, query5.sparql: files with SPARQL query expressions
Execution:
- Execute the following shell command to evaluate each individual SPARQL query expression
  - sparql –data $DataFile –query $QueryFile
  - $DataFile is the input RDF document to be queried, i.e. data.ttl
  - $QueryFile is a file with query expression to be evaluated, e.g. query1.sparql
Tools:
- IDLab Turtle Validator
References:
- RDF: RDF 1.1 Concepts and Abstract Syntax - W3C Recommendation (25 February 2014)
- TTL: RDF 1.1 Turtle: Terse RDF Triple Language - W3C Recommendation (25 February 2014)
- SPARQL: SPARQL 1.1 Query Language - W3C Recommendation (21 March 2013)
Server: nosql.felk.cvut.cz
- Do not forget to execute the homework submission script!
Deadline: Sunday 23. 10. 2022 until 23:59

4: MapReduce

Points: 20
Assignment:
- Create an input text file with sample data from the domain of your individual topic
  - Insert realistic and non-trivial data about at least 10 entities of one type
  - Put each of these entities on a separate line, i.e. assume that each line of the input file yields one input record
  - Organize the actual entity attributes in whatever way you are able to easily parse
  - E.g. Medvídek 2007 53 100 Trojan Macháček Vilhelmová corresponding to a pattern Movie Year Rating Length Actors…
- Implement a non-trivial MapReduce job
  - Choose from aggregation, grouping, filtering or any other general MapReduce usage pattern
  - Use WordCount.java source file as a basis for your own implementation
  - Both the Map and Reduce functions should be non-trivial, each about 10 lines of code
  - It is not necessary to implement the Combine function
- Comment the source file and also provide a description of the problem you are solving
- You may also create a shell script that allows for the execution of your entire MapReduce job
  - I.e. compile source files, deploy input file, execute the actual job, retrieve its result, …
  - However, this script is not supposed to be submitted and serves just for your own convenience
  - Even if you do so, it will not be used for the purpose of homework assessment in any way
Requirements:
- You may split your MapReduce job implementation into multiple Java source files
  - They all must be located in the submission root directory
  - At least MapReduce.java source file with its public MapReduce class is required
  - This class is expected to represent the main class of the entire MapReduce job
- Do not change the way how command line arguments are processed
  - I.e. the only two arguments represent the input and output HDFS locations respectively
- Do not use packages in order to organize your Java source files
- Assume that only hadoop-common-3.1.1.jar and hadoop-mapreduce-client-core-3.1.1.jar libraries will be linked with your project
- Do not submit your Netbeans (or any other) project directory, do not submit Hadoop (or any other) libraries
- Use Java Standard Edition version 7 or newer
- You are free to use your /user/f221_login/ HDFS home directory for debugging
  - Homework assessment will take place in a different dedicated HDFS directory
Submission:
- readme.txt: description of the input data structure and objective of the MapReduce job
- input.txt: text file with your sample input data (i.e. only one input file is permitted)
- MapReduce.java and possibly additional *.java: Java source files with your MapReduce implementation
- output.txt: expected output of your MapReduce job (i.e. submit the result of the execution you performed by yourself)
Tools:
- Apache Hadoop 3.1.1 (installed on the NoSQL server)
References:
- HDFS: Hadoop File System Shell commands
- MapReduce: MapReduce Tutorial
- MapReduce: MapReduce Commands Guide
- Hadoop: Hadoop JavaDoc API Documentation
Server: nosql.felk.cvut.cz
- Do not forget to execute the homework submission script!
Deadline: Monday 7. 11. 2022 until 23:59

5: Redis

Points: 10
Assignment:
- Create a script (ordinary text file) with a sequence of commands working with Redis
- Illustrate you can work with all data types (strings, lists, sets, sorted sets and hashes)
- In particular, perform all the following operations:
  - Strings: 5 insertions (SET), 1 read (GET), 1 update (APPEND, SETRANGE, INCR, …), 1 removal (DEL).
  - Lists: 5 insertions (LPUSH, RPUSH, …), 2 different reads (LPOP, RPOP, LINDEX, LRANGE), 1 removal (LREM).
  - Sets: 5 insertions (SADD), 2 different reads (SISMEMBER, SUNION, SINTER, SDIFF), 1 removal (SREM).
  - Sorted sets: 5 insertions (ZADD), 1 read (ZRANGE, ZRANGEBYSCORE), 1 update (ZINCRBY), 1 removal (ZREM).
  - Hashes: 5 insertions (HSET, HMSET), 2 different reads (HGET, HMGET, HKEYS, HVALS, …), 1 removal (HDEL).
- Your database (i.e. keys and values) as well as commands must be realistic and within your individual topic
  - E.g. use a hash to store a mapping from seats to passengers for each flight
  - HMSET seat-map-EK140-20171121 42A Peter 65F John
  - Key seat-map-EK140-20171121 is composed from a fixed prefix (seat-map), flight number (EK140) and date of departure (20171121)
  - The actual mapping contains pairs of seat numbers and passenger names, e.g. 42A Peter
- Add comments to your script using the ECHO command
  - Describe at least the intended structure of your keys and values in natural language
Requirements:
- Only use the database you are supposed to use when working on the assignment
  - Your database number is in the gray column in the table with points
- Do not switch to your database when you are inside your script
  - I.e. do not use a SELECT command to change the active database from within the script
  - Specify the intended database number outside your script using command line options (see below)
- Note that a different dedicated database will be used when assessing your homework
  - You can assume that this database will be completely empty at the beginning
Submission:
- script.txt: text file with Redis database commands
Execution:
- Execute the following shell command to evaluate the whole REDIS script
  - cat $ScriptFile | redis-cli -n $DatabaseNumber
  - $ScriptFile is a file with REDIS commands to be executed, i.e. script.txt
  - $DatabaseNumber is a number of database to be used, e.g. 5
Tools:
- Redis 3.2.4 (installed on the NoSQL server)
References:
Server: nosql.felk.cvut.cz
- Do not forget to execute the homework submission script!
Deadline: Monday 7. 11. 2022 until 23:59

6: Cassandra

Points: 15
Assignment:
- Create a script (ordinary text file) with a sequence of CQL statements working with Cassandra database
- Define a schema for 2 tables for entities of different types
  - Define at least one column for each of the following data types: tuple, list, set and map
- Insert about 5 rows into each of your tables
- Express at least 3 update statements
  - You must perform replace, add and remove primitive operations (all of them) on columns of all collection types (all of them)
  - I.e. you must involve at least altogether 9 different primitive operations on such columns
- Express 3 select statements
  - Use WHERE and ORDER BY clauses at least once (both of them)
  - Use ALLOW FILTERING in a query that cannot be evaluated without this instruction
- Create at least 1 secondary index
Requirements:
- Only use your own keyspace when working on the assignment
  - Name of this keyspace must be identical to your login name (f221_login)
  - Do not create this keyspace in your script (assume it already exists)
- Do not switch to your keyspace when you are inside your script
  - I.e. do not execute a USE command to change the active keyspace from within the script
  - Specify the intended keyspace outside your script using command line options (see below)
- Note that a different dedicated keyspace will be used when assessing your homework
  - You can assume that this keyspace will be completely empty at the beginning
Comments:
- The following error messages can be ignored:
  - Error from server: code=1300 [Replica(s) failed to execute read]…
Submission:
- script.cql: text file with CQL statements
Execution:
- Execute the following shell command to evaluate the whole CQL script
  - cqlsh -k $KeyspaceName -f $ScriptFile
  - $KeyspaceName is a name of keyspace that should be used (must already exist), e.g. f221_login
  - $ScriptFile is a file with CQL queries to be executed, i.e. script.cql
Tools:
- Apache Cassandra 3.11.1 (installed on the NoSQL server)
References:
- The Cassandra Query Language (CQL)
Server: nosql.felk.cvut.cz
- Do not forget to execute the homework submission script!
Deadline: Monday 14. 11. 2022 until 23:59

7: MongoDB

Points: 20
Assignment:
- Create a JavaScript script with a sequence of commands working with MongoDB database
- Explicitly create 2 collections for entities of different types
  - I.e., create them using createCollection method
- Insert about 5 documents into each one of them
  - These documents must be realistic, non-trivial, and with both embedded objects and arrays
  - Interlink the documents using references
  - Use insert operation at least once
- Express 3 update operations (do not use save operation for this purpose)
  - One without update operators
  - One with at least 2 different update operators
  - One based on the upsert mode
- Express 5 find queries (with non-trivial selections)
  - Use at least one logical operator ($and, $or, $not)
  - Use $elemMatch operator on array fields at least once
  - Use both positive and negative projection (each at least once)
  - Use sort modifier
  - Describe the real-world meaning of all your queries in comments
- Express 1 MapReduce query (non-trivial, i.e. not easily expressed using ordinary find operation)
  - Describe its meaning, contents of intermediate key-value pairs and the final output
  - Note that reduce function must be associative, commutative, and idempotent
Requirements:
- Call export LC_ALL=C in case you have difficulties in launching the mongo shell
- Only use your own database when working on the assignment
  - Name of this database must be identical to your login name (f221_login)
- Do not switch to your database when you are inside your script
  - I.e. do not execute USE database and nor db.getSiblingDB('database') commands
  - Specify the intended database outside your script using command line options (see below)
- Note that a different dedicated database will be used when assessing your homework
  - You can assume that this database will be completely empty at the beginning
- Print the output of your queries (find operations)
  - Use db.collection.find().forEach(printjson); approach for this purpose
- Print the output of your MapReduce job using out: { inline: 1 } option
  - I.e. do not redirect the output into a standalone collection
Submission:
- script.js: JavaScript script with MongoDB database commands
Execution:
- Execute the following shell command to evaluate the whole MongoDB script
  - mongosh “mongodb:/⁠/nosql.felk.cvut.cz:42222/$database” -u $username -p $password –authenticationDatabase admin < $file
  - $login is your username, e.g. f221_login
  - $database - database to connect to (same as login)
  - $password is your password
  - $file is a file with MongoDB queries to be executed, i.e. script.js
Tools:
- MongoDB 6.0.1 (installed on the NoSQL server)
References:
- https://docs.mongodb.com
Server: nosql.felk.cvut.cz
- Do not forget to execute the homework submission script!
Deadline: Monday 28. 11. 2022 until 23:59

8: MongoDB-2

Points: 15
Assignment:
- Create a JavaScript script with a sequence of commands working with MongoDB database
- Use 2 created collections for entities of different types
- If necessary, insert more documents into each one of them
- Express 5 aggregate operations
  - Use at least once $match, $group, $sort, $project (or $addFields), $skip and $limit stages
  - Use at least once $sum (or $avg), $count, $min (or $max), $first (or $last) aggregators
  - Describe the real-world meaning of all your queries in comments
Requirements:
- Only use your own database when working on the assignment
  - Name of this database must be identical to your login name (f221_login)
- Do not switch to your database when you are inside your script
  - I.e. do not execute USE database and nor db.getSiblingDB('database') commands
  - Specify the intended database outside your script using command line options (see below)
Submission:
- script.js: JavaScript script with MongoDB database commands
- Create folder for submission: mkdir -p ~/assignments/mongodb2
- Add a script there ~/assignments/mongodb2/script.js
- Submit homework cd ~/assignments/mongodb2

sudo submit_execute mongodb2

Execution:
- Execute the following shell command to evaluate the whole MongoDB script
  - mongosh “mongodb:/⁠/nosql.felk.cvut.cz:42222/$database” -u $username -p $password –authenticationDatabase admin < $file
  - $login is your username, e.g. f221_login
  - $database - database to connect to (same as login)
  - $password is your password
  - $file is a file with MongoDB queries to be executed, i.e. script.js
Tools:
- MongoDB 6.0.1 (installed on the NoSQL server)
References:
- https://docs.mongodb.com
Server: nosql.felk.cvut.cz
- Do not forget to execute the homework submission script!
Deadline: Monday 5. 12. 2022 until 23:59

Extra homework on MongoDB

Points: 10
Assignment: see in Google document
Deadline: Friday 6. 1. 2023 until 23:59

9: Neo4j

Points: 20
Assignment:
- Insert realistic nodes and relationships into your embedded Neo4j database
  - Use a single CREATE statement for this purpose
  - Insert altogether at least 10 nodes for entities of at least 2 different types (i.e. different labels)
  - Insert altogether at least 15 relationships of at least 2 different types
  - Include properties (both for nodes and relationships)
  - Associate all your nodes with user-defined identifiers
- Express 5 Cypher query expressions
  - Use at least once MATCH, OPTIONAL MATCH, RETURN, WITH, WHERE, and ORDER BY (sub)clauses (all of them)
  - Aggregation in at least one query
Requirements:
- Describe the meaning of your Cypher expressions in natural language (via // comment)
Submission: send by email
- data file (text file with inserted data), queries.cypher: text file with a sequence of Cypher statements (including CREATE) and screenshots/video of execution
Execution:
- Execute the following shell command to evaluate the whole Neo4j script
  - cypher-shell -f $ScriptFile
  - $ScriptFile is a file with Cypher queries to be executed, i.e. queries.cypher
Tools:
- Neo4j 3.0.7 (installed on the NoSQL server)
References:
- Cypher query language
- Cypher Reference Card
Deadline: Monday 12. 12. 2022 until 23:59

Individual Topics

Please, fill in your name and surname near one of the topics in the DS2 topics table or add your own topic at the bottom of the DS2 topics table.
Try to propose your own original topic in the first place
You can also get inspired by the following topics (in alphabetical order, in English and in Czech)
- Access system, Accommodation booking, Accommodation comparator, Accommodation sharing, Agricultural production, Air rescue service, Air traffic management, Airline, Airport, Armory, Army, Artworks, Assignment submission, ATM network, Attendance system, Auction, Bakery, Bank, Bank account, Bazaar, Beekeeper, Betting shop, Beverages store, Bike sharing, Black market, Blog, Boat rental, Bookstore, Botanic garden, Brewery, Building materials store, Bus station, Bus tickets, Business register, Cadastre, Cafe, Canteens, Car rental, Car repair shop, Car showroom, Casino, Castles, Catering, Caves, Cemetery, Cinema, City tours, Classbook, Collection and disposal of waste, Collection of laws, College dorm, Computer games, Conference, Construction management, Content management system, Contract register, Convenience store, Cookbook, Cooking classes, Council meetings, Countries of the world, Courier service, Cowshed, Dance school, Deliveries, Desk games, Discussion forum, Doctor's office, Dog park, Dog shelter, Driving school, Drugs, Dump, Educational institution, Elections, Electronic prescriptions, Employee records, Empty houses, Entertainment center, Environmental center, Exhibition, Exhibition grounds, Experience donation, Fairy tales, Farmer markets, Finance manager, Financial advisory, Financial markets, Fire protection, Fishing equipment, Fitness center, Flat owners association, Fleet, Flight ticket booking, Food bank, Food distribution, Football league, Football team, Forest kindergarten, Forwarding company, Foster care, Gallery, Garden center, Gardening colony, Gas station, Glassworks, Golf clubs, Grant agency, Grid, Hair salon, Handyman, Hardware, Health insurance, High school, Highway fees, Hiking trails, Hobby market, Hockey league, Holiday offers, Horse racing, Hospital, Hotel, Housing association, Chamber of deputies, Chess club, Chess competition, Chess database, Incinerator, Industrial zone, Insurance company, Intelligence service, Intersport arena, Job offers, Jurassic park, Kindergarten, Laboratory, Labour office, Language school, Lego, Leisure activities, Library, Log book, Logistics center, Logistics company, Logistics warehouse, Lottery, Luggage storage, Manufacturing processes, Maternity hospital, Medical reimbursement, Meeting scheduling, Menu, Metro operation, Military area, Mobile operator, Mobile phones, Model trains, Morgue, Mountain rescue service, Movies, Multinational company, Multiplex network, Museum, Music festival, Music production, Musical instruments, National parks, Nature reserve, Newspaper publishing, Non-bank loans, Nuclear power plant, Nutritional values, Online exercises, Online streaming service, Orienteering, Outdoor swimming pool, Parking lot, Parts catalog, Patient medical card, Pawnshop, Payment cards, Personal documents, Personal trainer, Pharmacy, Photo album, Pizzeria, Plagiarism detection, Planning calendar, Police database, Political parties, Popular music, Population register, Post, Postal addresses, Poultry farming, Prestashop, Prison, Procurement, Project management, Property administration, Psychiatric hospital, Public greenery, Public transport, Railway network, Real estate agency, Recruitment agency, Refugee camp, Registration of sales, Regulatory fees, Research projects, Research publications, Restaurant, Restaurant reservations, Road closures, Room reservation, Scout group, Scrapyard, Security agency, Seizures, Shared travel, Shooting range, Shopping center, Ski school, Skiing area, Sobering-up cell, Social benefits, Social network, Software development, Spare parts, Sports club, Sports tournament, Stable, Statement of work, Stock exchange, Student book, Study abroad, Study materials, Study system, Subsidy programs, Summer camp, Supermarket, Sweet-shop, Swimming pool, Symphony orchestra, Tax office, Taxi service, Teahouse, Theater, Theater plays, Time tables, Tollgates, Tourism, Tourist group, Traffic accidents, Traffic control center, Train station, Transport company, Transport control, Travel agency, Trial, Truck transport, TV program, TV series, Universe, Vaccination abroad, Veterinary clinic, Video shop, Virtual tours, Visas, War conflicts, Water park, Water supply, Weapons, Weather forecast, Webhosting, Webshop, Wedding dress rental, Wholesale, Winter road cleaning, World heritage list, Zoning plan, Zoo
- Adresní místa, Aquapark, Armáda, Aukce, Autobusové nádraží, Autosalon, Autoškola, Banka, Bankovní účet, Bazar, Bezpečnostní agentura, Blog, Botanická zahrada, Burza, Bytové družstvo, Catering, Cestovní kancelář, Cukrárna, Cvičiště pro psy, Čajovna, Černý trh, Čerpací stanice, Dálniční poplatky, Darování zážitků, Deskové hry, Detekce plagiátů, Diskuzní fórum, Divadelní hry, Divadlo, Dodávka vody, Docházkový systém, Dopravní dispečink, Dopravní nehody, Dopravní podnik, Dopravní uzavírky, Doručování zásilek, Dotační programy, Ekologické centrum, Elektronická evidence tržeb, Elektronické recepty, Evidence smluv, Evidence součástek, Evidence zaměstnanců, Exekuce, Farmářské trhy, Filmy, Finanční poradenství, Finanční trhy, Finanční úřad, Fitness centrum, Fotbalová liga, Fotbalový tým, Fotoalbum, Galerie, Golfové kluby, Grantová agentura, Hardware, Hobby market, Hodinový manžel, Hokejová liga, Horská služba, Hotel, Hrady a zámky, Hřbitov, Hudební festival, Hudební nástroje, Hudební produkce, Jaderná elektrárna, Jazyková škola, Jazykové pobyty, Jednání zastupitelstva, Jeskyně, Jídelníček, Jízdenky na autobus, Jízdní řády, Jurský park, Kadeřnický salon, Kamionová doprava, Kasino, Katastr nemovitostí, Kavárna, Kino, Kniha jízd, Knihkupectví, Knihovna, Konference, Koňské dostihy, Koupaliště, Kravín, Kuchařka, Kurýrní služba, Kurzy vaření, Laboratoř, Lékárna, Lékařská karta pacienta, Léky, Lesní školka, Letecká společnost, Letecká záchranná služba, Letiště, Letní tábor, Logistická firma, Logistické centrum, Logistický sklad, Loterie, Lyžařská škola, Lyžařský areál, Márnice, Mateřská škola, Menzy, Městská hromadná doprava, Městské exkurze, Mobilní operátor, Mobilní telefony, Modely vláčků, Multifunkční aréna, Muniční sklad, Muzeum, Mýtné brány, Nabídky dovolené, Nabídky práce, Nadnárodní společnost, Náhradní díly, Národní park, Nebankovní půjčky, Nemocnice, Nutriční hodnoty, Obchodní centrum, Obchodní rejstřík, Očkování do ciziny, Odevzdávání úkolů, Online cvičení, Online půjčovna seriálů, Ordinace lékaře, Orientační běh, Osobní doklady, Osobní trenér, Parkoviště, Pekárna, Personální agentura, Pěstounská péče, Pivovar, Pizzerie, Plánovací kalendář, Plánování schůzek, Platební karty, Plavecký bazén, Pneuservis, Počítačové hry, Pohádky, Pojišťovna, Policejní databáze, Politické strany, Populární hudba, Porodnice, Poslanecká sněmovna, Pošta, Potravinová banka, Požární ochrana, Pracovní úřad, Prázdné domy, Prestashop, Provoz metra, Průmyslová zóna, Předpověď počasí, Přepravní kontrola, Přírodní rezervace, Přístupový systém, Psí útulek, Psychiatrická léčebna, Půjčovna auta, Půjčovna lodí, Půjčovna svatebních šatů, Realitní agentura, Redakční systém, Registr obyvatel, Regulační poplatky, Restaurace, Rezervace letenek, Rezervace místností, Rezervace ubytování, Rezervace v restauraci, Rozvodná síť, Rozvoz jídla, Rybářské potřeby, Řízení leteckého provozu, Řízení projektů, Sázková kancelář, Sbírka zákonů, Sdílená kola, Sdílené cestování, Síť bankomatů, Síť multikin, Skautské středisko, Sklad nápojů, Skládka, Sklárna, Sociální dávky, Sociální síť, Soudní řízení, Spalovna, Spediční firma, Společenství vlastníků jednotek, Sportovní klub, Sportovní turnaj, Správa objektů, Správce financí, Srovnávač ubytování, Stáj, Státy světa, Stavební řízení, Stavebnice lego, Stavebniny, Střední škola, Střelnice, Studijní materiály, Studijní systém, Supermarket, Světové dědictví, Svoz a likvidace odpadů, Symfonický orchestr, Šachová databáze, Šachová soutěž, Šachový klub, Taneční škola, Taxi služba, Televizní program, Televizní seriály, Třídní kniha, Turistické cesty, Turistický oddíl, Turistický ruch, Ubytování v soukromí, Uprchlický tábor, Úschovna zavazadel, Územní plán, Válečné konflikty, Včelař, Večerka, Vědecké projekty, Vědecké publikace, Velkochov drůbeže, Velkoobchod, Veřejná zeleň, Veřejné zakázky, Vesmír, Veterinární klinika, Vězení, Videopůjčovna, Virtuální prohlídky, Víza, Vlakové nádraží, Vojenský prostor, Volby, Volnočasové aktivity, Vozový park, Vrakoviště, Vydavatelství novin, Výkaz práce, Výrobní procesy, Vysokoškolská kolej, Výstava, Výstaviště, Výtvarná díla, Vývoj softwaru, Vzdělávací instituce, Webhosting, Webový obchod, Zábavní centrum, Zahrádkářská kolonie, Zahradnictví, Záchytka, Zastavárna, Zbraně, Zdravotní pojišťovna, Zdravotní úhrady, Zemědělská výroba, Zimní úklid komunikací, Zoologická zahrada, Zpravodajská služba, Žákovská knížka, Železniční síť
Nevertheless, the following topics are not allowed this semester
- Movies, actors

Exam Requirements

For online exam: - Use zoom - You must turn on the camera

For written exam: - You can use paper or your laptops, the latter is preferable - Strict limitation in time

NoSQL Introduction

Big Data and NoSQL terms, V characteristics (volume, variety, velocity, veracity, value, validity, volatility), current trends and challenges (Big Data, Big Users, processing paradigms, …), principles of relational databases (functional dependencies, normal forms, transactions, ACID properties); types of NoSQL systems (key-value, wide column, document, graph, …), their data models, features and use cases; common features of NoSQL systems (aggregates, schemalessness, scaling, flexibility, sharding, replication, automated maintenance, eventual consistency, …)

Data Formats

XML: constructs (element, attribute, text, …), content model (empty, text, elements, mixed), entities, well-formedness; document and data oriented XML
JSON: constructs (object, array, value), types of values (strings, numbers, …); BSON: document structure (elements, type selectors, property names and values)
RDF: data model (resources, referents, values), triples (subject, predicate, object), statements, blank nodes, IRI identifiers, literals (types, language tags); graph representation (vertices, edges); N-Triples notation (RDF file, statements, triple components, literals, IRI references); Turtle notation (TTL file, prefix definitions, triples, object and predicate-object lists, blank nodes, prefixed names, literals)
CSV: constructs (document, header, record, field)

XML Databases

Native XML databases vs. XML-enabled relational databases; data model (XDM): tree (nodes for document, elements, attributes, texts, …), document order, reverse document order, sequences, atomic values, singleton sequences
XPath language: path expressions (relative vs. absolute, evaluation algorithm), path step (axis, node test, predicates), axes (forward: child, descendant, following, …; reverse: parent, ancestor, preceding, …; attribute), node tests, predicates (path conditions, position testing, …), abbreviations
XQuery language: path expressions, direct constructors (elements, attributes, nested queries, well-formedness), computed constructors (dynamic names), FLWOR expressions (for, let, where, order by, and return clauses), typical FLWOR use cases (joining, grouping, aggregation, integration, …), conditional expressions (if, then, else), switch expressions (case, default, return), universal and existential quantified expressions (some, every, satisfies), comparisons (value, general, node; errors), atomization of values (elements, attributes)

RDF Stores

Linked Data: principles (identification, standard formats, interlinking, open license), Linked Open Data Cloud
SPARQL: graph pattern matching (solution sequence, solution, variable binding, compatibility of solutions), graph patterns (basic, group, optional, alternative, graph, minus); prologue declarations (BASE, PREFIX clauses), SELECT queries (SELECT, FROM, and WHERE clauses), query dataset (default graph, named graphs), variable assignments (BIND), FILTER constraints (comparisons, logical connectives, accessors, tests, …), solution modifiers (DISTINCT, REDUCED; aggregation: GROUP BY, HAVING; sorting: ORDER BY, LIMIT, OFFSET), query forms (SELECT, ASK, DESCRIBE, CONSTRUCT)

MapReduce

Programming models, paradigms and languages; parallel programming models, process interaction (shared memory, message passing, implicit interaction), problem decomposition (task parallelism, data parallelism, implicit parallelism)
MapReduce: programming model (data parallelism, map and reduce functions), cluster architecture (master, workers, message passing, data distribution), map and reduce functions (input arguments, emission and reduction of intermediate key-value pairs, final output), data flow phases (mapping, shuffling, reducing), input parsing (input file, split, record), execution steps (parsing, mapping, partitioning, combining, merging, reducing), combine function (commutativity, associativity), additional functions (input reader, partition, compare, output writer), implementation details (counters, fault tolerance, stragglers, task granularity), usage patterns (aggregation, grouping, querying, sorting, …)
Apache Hadoop: modules (Common, HDFS, YARN, MapReduce), related projects (Cassandra, HBase, …); HDFS: data model (hierarchical namespace, directories, files, blocks, permissions), architecture (NameNode, DataNode, HeartBeat messages, failures), replica placement (rack-aware strategy), FsImage (namespace, mapping of blocks, system properties) and EditLog structures, FS commands (ls, mkdir, …); MapReduce: architecture (JobTracker, TaskTracker), job implementation (Configuration; Mapper, Reducer, and Combiner classes; Context, write method; Writable and WritableComparable interfaces), job execution schema

NoSQL Principles

Scaling: scalability definition; vertical scaling (scaling up/down), pros and cons (performance limits, higher costs, vendor lock-in, …); horizontal scaling (scaling out/in), pros and cons, network fallacies (reliability, latency, bandwidth, security, …), cluster architecture; design questions (scalability, availability, consistency, latency, durability, resilience)
Distribution models: sharding: idea, motivation, objectives (balanced distribution, workload, …), strategies (mapping structures, general rules), difficulties (evaluation of requests, changing cluster structure, obsolete or incomplete knowledge, network partitioning, …); replication: idea, motivation, objectives, replication factor, architectures (master-slave and peer-to-peer), internal details (handling of read and write requests, consistency issues, failure recovery), replica placement strategies; mutual combinations of sharding and replication
CAP theorem: CAP guarantees (consistency, availability, partition tolerance), CAP theorem, consequences (CA, CP and AP systems), consistency-availability spectrum, ACID properties (atomicity, consistency, isolation, durability), BASE properties (basically available, soft state, eventual consistency)
Consistency: strong vs. eventual consistency; write consistency (write-write conflict, context, pessimistic and optimistic strategies), read consistency (read-write conflict, context, inconsistency window, session consistency), read and write quora (formulae, motivation, workload balancing)

Key-Value Stores

Data model (key-value pairs), key management (real-world identifiers, automatically generated, structured keys, prefixes), basic CRUD operations, use cases, representatives, extended functionality (MapReduce, TTL, links, structured store, …)
- Redis: features (in-memory, data structure store), data model (databases, objects), data types (string, list, set, sorted set, hash), string commands (SET, GET, APPEND, SETRANGE, INCR, DEL, …), list commands (LPUSH, RPUSH, LPOP, RPOP, LINDEX, LRANGE, LREM, …), set commands (SADD, SISMEMBER, SUNION, SINTER, SDIFF, SREM, …), sorted set commands (ZADD, ZRANGE, ZRANGEBYSCORE, ZINCRBY, ZREM, …), hash commands (HSET, HMSET, HGET, HMGET, HKEYS, HVALS, HDEL, …), general commands (EXISTS, KEYS, DEL, RENAME, …), time-to-live commands (EXPIRE, TTL, PERSIST)

Wide Column Stores

Data model (column families, rows, columns), query patterns, use cases, representatives
Cassandra: data model (keyspaces, tables, rows, columns), primary keys (partition key, clustering columns), column values (missing; empty; native data types, tuples, user-defined types; collections: lists, sets, maps; frozen mode), additional data (TTL, timestamp); CQL language: DDL statements: CREATE KEYSPACE (replication options), DROP KEYSPACE, USE keyspace, CREATE TABLE (column definitions, usage of types, primary key), DROP TABLE, TRUNCATE TABLE; native data types (int, varint, double, boolean, text, timestamp, …); literals (atomic, collections, …); DML statements: SELECT statements (SELECT, FROM, WHERE, GROUP BY, ORDER BY, and LIMIT clauses; DISTINCT modifier; selectors; non/filtering queries, ALLOW FILTERING mode; filtering relations; aggregates; restrictions on sorting and aggregation), INSERT statements (update parameters: TTL, TIMESTAMP), UPDATE statements (assignments; modification of collections: additions, removals), DELETE statements (deletion of rows, removal of columns, removal of items from collections)

Document Stores

Data model (documents), query patterns, use cases, representatives
MongoDB: data model (databases, collections, documents, field names), document identifiers (features, ObjectId), data modeling (embedded documents, references); CRUD operations (insert, update, save, remove, find); insert operation (management of identifiers); update operation: replace vs. update mode, multi option, upsert mode, update operators (field: set, rename, inc, …; array: push, pop, …); save operation (insert vs. replace mode); remove operation (justOne option); find operation: query conditions (value equality vs. query operators), query operators (comparison: eq, ne, …; element: exists; evaluation: regex, …; logical: and, or, not; array: all, elemMatch, …), dot notation (embedded fields, array items), querying of arrays, projection (positive, negative), projection operators (array: slice, elemMatch), modifiers (sort, skip, limit); MapReduce (map function, reduce function, options: query, sort, limit, out); primary and secondary index structures (index types: value, hashed, …; forms; properties: unique, partial, sparse, TTL)

Graph Databases

Data model (property graphs), use cases, representatives
Neo4j: data model (graph, nodes, relationships, directions, labels, types, properties), properties (fields, atomic values, arrays); embedded database mode; traversal framework: traversal description, order (breadth-first, depth-first, branch ordering policies), expanders (relationship types, directions), uniqueness (NODE_GLOBAL, RELATIONSHIP_GLOBAL, …), evaluators (INCLUDE/EXCLUDE and CONTINUE/PRUNE results; predefined evaluators: all, excludeStartPosition, …; custom evaluators: evaluate method), traverser (starting nodes, iteration modes: paths, end nodes, last relationships); Java interface (labels, types, nodes, relationships, properties, transactions); Cypher language: graph matching (solutions, variable bindings); query sub/clauses (read, write, general); path patterns, node patterns (variable, labels, properties), relationship patterns (variable, types, properties, variable length); MATCH clause (path patterns, WHERE conditions, uniqueness requirement, OPTIONAL mode); RETURN clause (DISTINCT modifier, ORDER BY, LIMIT, SKIP subclauses, aggregation); WITH clause (motivation, subclauses); write clauses: CREATE, DELETE (DETACH mode), SET (properties, labels), REMOVE (properties, labels); query structure (chaining of clauses, query parts, restrictions)

Advanced Aspects

Graph databases: non/transactional databases, query patterns (CRUD, graph algorithms, graph traversals, graph pattern matching, similarity querying); data structures (adjacency matrix, adjacency list, incidence matrix, Laplacian matrix), graph traversals, data locality (BFS layout, matrix bandwidth, bandwidth minimization problem, Cuthill-McKee algorithm), graph partitioning (1D partitioning, 2D partitioning, BFS evaluation), graph matching (sub-graph, super-graph patterns), non/mining based indexing
Performance tuning: scalability goals (reduce latency, increase throughput), Amdahl's law, Little's law, message cost model
Polyglot persistence

Recommended Literature

This course is based on materials by Martin Svoboda

Holubová, Irena - Kosek, Jiří - Minařík, Karel - Novák, David: Big Data a NoSQL databáze.

ISBN: 978-80-247-5466-6 (hardcover), 978-80-247-5938-8 (eBook PDF), 978-80-247-5939-5 (eBook EPUB).
Grada Publishing, a.s., 2015.

Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled.

ISBN: 978-0-321-82662-6.
Pearson Education, Inc., 2013.

Wiese, Lena: Advanced Data Management: For SQL, NoSQL, Cloud and Distributed Databases.

ISBN: 978-3-11-044140-6 (hardcover), 978-3-11-044141-3 (eBook PDF), 978-3-11-043307-4 (eBook EPUB).
DOI: 10.1515/9783110441413.
Walter de Gruyter GmbH, 2015.

Zomaya, Albert Y. - Sakr, Sherif: Handbook of Big Data Technologies.

ISBN: 978-3-319-49339-8 (hardcover), 978-3-319-49340-4 (eBook).
DOI: 10.1007/978-3-319-49340-4.
Springer International Publishing AG, 2017.

Table of Contents