====== B4M36DS2, BE4M36DS2: Database Systems 2 ====== ===== Basic Information ===== * Annotations: [[https://www.fel.cvut.cz/cz/education/bk/predmety/47/02/p4702006.html|B4M36DS2]], [[https://www.fel.cvut.cz/en/education/bk/predmety/48/78/p4878406.html|BE4M36DS2]] (English) * Lecturer and tutor: **Yuliia Prokop** * Schedule: [[https://fel.cvut.cz/cz/education/rozvrhy-ng.B221/public/html/predmety/47/02/p4702006.html|B4M36DS2]], [[https://fel.cvut.cz/cz/education/rozvrhy-ng.B221/public/html/predmety/48/78/p4878406.html|BE4M36DS2]] * Lectures: **Monday 9:15 - 10:45 (KN:E-301)** (English) * Practical classes (group 101): **Monday 12:45 - 14:15 (KN:E-328)** (Czech) * Practical classes (group 102): **Monday 14:30 - 16:00 (KN:E-328)** (English) * Practical classes (group 103): **Monday 16:15 - 17:45 (KN:E-328)** (Czech) * [[https://docs.google.com/spreadsheets/d/1owu6qkcbv10GatuOHukvrzCo-8uJReFn9BHOicD3Dt8/edit?usp=sharing|Table with points]] from practical classes, homework assignments and exam tests UPDATED ===== Exam Dates ===== * **Thursday 12. 1. 2023**: 14:00 - 15:30 (online) **[[https://docs.google.com/spreadsheets/d/1TuU_PUxNrdQo386X0_r57n3NaJePw8-UAaymE6_EG58/edit?usp=sharing|Results 12/1/2023]]** UPDATED * Questions and (optional) oral examination - Monday 23. 01. 2023 : 9:15 - 12:00 (KN:E-301) or Wednesday 1. 2. 2023 : 9:15 - 12:00 (KN:E-328) * **Monday 16. 1. 2023**: 9:15 - 11:45 (KN:E-301) **[[https://docs.google.com/spreadsheets/d/1AoGoHuqv-q0p5L6_O5xaOk5JiB5Ny8uMKZxQA7NBrhM/edit?usp=sharing|Results 16/1/2023]]** * Questions and (optional) oral examination - Monday 23. 01. 2023 : 9:15 - 12:00 (KN:E-301) or Wednesday 1. 2. 2023 : 9:15 - 12:00 (KN:E-328) * **Monday 23. 1. 2023**: 9:15 - 11:45 (KN:E-301) **[[https://docs.google.com/spreadsheets/d/14CfAqY9Xa8VRTttPLk73mtWov1x32l2yFI9ULhWKkJI/edit?usp=sharing|Results 23/1/2023]]** * Questions and (optional) oral examination - Wednesday 1. 2. 2023 : 9:15 - 12:00 (KN:E-328) or Wednesday 15. 2. 2023: 9:15 - 11:45 (KN:E-328) * **Wednesday 1. 2. 2023**: 9:15 - 11:45 (KN:E-328) **[[https://docs.google.com/spreadsheets/d/114JF056Hyl-kI-UaUwYbVfCA_1r762LsEO4usxkQ77k/edit?usp=sharing|Results 1/2/2023]]** * Questions and (optional) oral examination - Wednesday 15. 2. 2023 : 9:15 - 12:00 (KN:E-328) * **Wednesday 15. 2. 2023**: 9:15 - 11:45 (KN:E-328)**[[https://docs.google.com/spreadsheets/d/1EfxY2bmSrPaFSV06kUkLMS1S031vUvx_eHD3j0JqF5Q/edit?usp=sharing|Results 15-16/2/2023]]** ===== Homework Deadlines ===== * 00 - Topic selection: **Monday 4. 10. 2022** until 23:59 * 01 - **[[#XPath|XPath]]**: **Monday 10. 10. 2022** until 23:59 * 02 - **[[#XQuery|XQuery]]**: **Monday 17. 10. 2022** until 23:59 * 03 - **[[#SPARQL|SPARQL]]**: **Monday 24. 10. 2022** until 23:59 * 04 - **[[#MapReduce|MapReduce]]**: **Monday 7. 11. 2022** until 23:59 * 05 - **[[#Redis|Redis]]**: **Monday 7. 11. 2022** until 23:59 * 06 - **[[#Cassandra|Cassandra]]**: **Monday 14. 11. 2022** until 23:59 * 07 - **[[#MongoDB|MongoDB]]**: **Monday 28. 11. 2022** until 23:59 * 08 - **[[#MongoDB-2|MongoDB-2]]**: **Monday 5. 12. 2022** until 23:59 * 09 - **[[#Neo4j|Neo4j]]**: **Monday 12. 12. 2022** until 23:59 ===== Lectures ===== * 19. 09. 2022: **01 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lecture-01-introduction.pdf|Introduction]]**: Big Data, NoSQL Databases * 26. 09. 2022: **02 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lecture-02-formats.pdf|Data Formats]]**: XML, JSON, BSON, RDF * 03. 10. 2022: **03 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lecture-03-xpath.pdf|XML Databases]]**: XPath * 10. 10. 2022: **04 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lecture-04-xquery.pdf|XML Databases]]**: XQuery * 17. 10. 2022: **05 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lecture-05-sparql.pdf|RDF Stores]]**: SPARQL * 24. 10. 2022: **06 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lecture-06-mapreduce.pdf|Apache Hadoop]]**: MapReduce, HDFS * 31. 10. 2022: **07 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lecture-07-principles.pdf|Basic Principles]]**: Scaling, Sharding, Replication, CAP Theorem, Consistency * 07. 11. 2022: **08 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lecture-09-cassandra.pdf|Wide Column Stores]]**: Cassandra: CQL * 14. 11. 2022: **09 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lecture-10-mongodb.pdf|Document Databases]]**: MongoDB * 21. 11. 2022: **10 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lecture-10-mongodb2_2_.pdf|Document Databases]]**: MongoDB: Aggregation * 28. 11. 2022: **10 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lecture-11-neo4j.pdf|Graph Databases]]**: Neo4j: Traversal Framework * 05. 12. 2022: **12 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lecture-12-cypher.pdf|Graph Databases]]**: Neo4j: Cypher * 12. 12. 2022: **13 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lecture-13-advanced.pdf|Advanced Aspects]]**: Graph Databases, Performance Tuning * 09. 01. 2023: Cancelled ===== Practical Classes ===== * 19. 09. 2022: **00 - [[https://cw.fel.cvut.cz/b221/_media/courses/be4m36ds2/b4m36ds2-lab-00-organization.pdf|Organization]]** * 26. 09. 2022: **01 - [[https://cw.fel.cvut.cz/b221/_media/courses/be4m36ds2/b4m36ds2-lab-01-formats.pdf|Formats]]** * Tools: [[https://codebeautify.org/xmlvalidator|XML Editor]], [[https://codebeautify.org/jsonvalidator|JSON Editor]], [[http://ttl.summerofcode.be/|RDF Editor]] * Solutions: [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/lab-01-formats-solutions.zip|Solutions]] * 03. 10. 2022: **02 - [[https://cw.fel.cvut.cz/b221/_media/courses/be4m36ds2/b4m36ds2-lab-02-xpath.pdf|XPath]]** * Data files: [[https://cw.fel.cvut.cz/b221/_media/courses/be4m36ds2/data.xml.txt|data.xml]] * Tools: [[http://videlibri.sourceforge.net/cgi-bin/xidelcgi|XPath and XQuery Processor]] * Solutions: [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/queries.txt|Solutions]] * 10. 10. 2022: **03 - [[https://cw.fel.cvut.cz/b221/_media/courses/be4m36ds2/b4m36ds2-lab-03-xquery.pdf|XQuery]]** * Data files: [[https://cw.fel.cvut.cz/b221/_media/courses/be4m36ds2/data.xml.txt|data.xml]] * Tools: [[http://videlibri.sourceforge.net/cgi-bin/xidelcgi|XPath and XQuery Processor]] * Solutions: [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/queries.xq.txt|Solutions]] * 17. 10. 2022: **04 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lab-04-sparql.pdf|SPARQL]]** * Data files: [[https://cw.fel.cvut.cz/b221/_media/courses/be4m36ds2/data.ttl.txt|data.ttl]] * Solutions: [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/queries.pdf|Solutions]] * SPARQL endpoint: https://nosql.opendata.cz/sparql * 24. 10. 2022: **05 - [[https://cw.fel.cvut.cz/b221/_media/courses/be4m36ds2/b4m36ds2-lab-05-mapreduce.pdf|MapReduce]]** * Source files: [[https://cw.fel.cvut.cz/b221/_media/courses/be4m36ds2/wordcount.java|WordCount.java]], [[https://cw.fel.cvut.cz/b221/_media/courses/be4m36ds2/invertedindex.java|InvertedIndex.java]] * See /home/DS2/mapreduce/ directory for input data and Hadoop libraries * 31. 10. 2022: **06 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lab-06-redis_.pdf|Redis]]** * 07. 11. 2022: **07 - [[https://cw.fel.cvut.cz/b221/_media/courses/be4m36ds2/b4m36ds2-lab-08-cassandra.pdf|Cassandra]]** * 14. 11. 2022: **08 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lab-09-mongodb.pdf|MongoDB]]** * 21. 11. 2022: **09 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lab-10-mongodb.pdf|MongoDB]]** * Data file: [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/data.js.txt|data.js]] * Solutions: [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/queries.js.txt|queries.js]] * 28. 11. 2022: **10 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lab-11-mongodb.pdf|MongoDB]]** * Data files: [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/users.js.txt|users.js]], [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/checkins.js.txt|checkin.js]] * Solutions: [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/queries2.js.txt|queries.js]] * 05. 12. 2022: **11 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lab-11-neo4j.pdf|Neo4j]]** * Data files: [[https://cw.fel.cvut.cz/b221/_media/courses/be4m36ds2/data.cypher.txt|data.cypher]] * Solutions: [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/queries.cypher.txt|queries.cypher]] * 12. 12. 2022: **12 - [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/b4m36ds2-lab-12-neo4j.pdf|Neo4j]]** * Solutions: [[https://cw.fel.cvut.cz/wiki/_media/courses/be4m36ds2/myneo4japp.java.txt|MyNeo4jApp.java]] * 09. 01. 2023: //Cancelled// ===== Formal Requirements ===== * **Attendance** during lectures and practical classes is **recommended** but not compulsory * Altogether **9 individual homework assignments** will be given during the semester * Everyone must **choose** their **distinct topic**, not later than during the XPath practical class * This topic must be reported to and explicitly accepted by the lecturer in advance * Possible topics could be: library, cinema, cookbook, university, flights, etc. * See the list below for additional suitable topics, feel free to choose your own topic * Your homework solutions must be **within the topic, original, realistic, and non-trivial** * Solutions can only be submitted via a script executed on the corresponding server * **At most 150 points** in total can be gained **for all the homework assignments** * Solutions are awarded by **up to 20, 15 or 10 points** respectively, depending on the assignment * In case of any shortcomings, fewer points will be awarded appropriately * Solutions can be submitted even repeatedly, only **the latest version is assessed** * Once a given assignment is assessed by the lecturer, it cannot be resubmitted once again * **Delay** of one whole day **is penalized** by 5 points, shorter delays are penalized proportionally * Should the delay be even longer, the penalty stays the same and does not further increase * All the homework assignments must be submitted before the intended exam date in order to be considered * None of the homework assignments is compulsory, yet you are encouraged to submit all of them * During some of the practical classes, **extra activity points** can be acquired, too * **At least 130 points** is required for the **course credit** to be granted * Half of all the points above this boundary is transferred as **bonus points** to the exam * Only students with a course credit already acquired can sign up for the final exam * The **final exam** consists of a compulsory written test and an optional oral examination * **At most 100 points** can be acquired from the actual final **written test** * This test consists of a theoretical part (open and multiple choice questions) and a practical part (exercises) * Having **less than 30% points** from any of the two parts **prevents from passing the exam successfully** * The **final score** corresponds to the **sum of the written test and bonus points**, if any * Based on the result, everyone can voluntarily choose to undergo an **oral examination** * The only condition is to have at least 50 points from the test and bonus points * In such a case, the final score is further adjusted by **up to minus 10 to plus 5 points** * The oral examination can also be requested by the examiner in case of uncertainties in the test * Final grade: **90 points and more for A, 80+ for B, 70+ for C, 60+ for D, and 50+ for E** ===== Homework Assignments ===== * Preliminaries: * NoSQL server: **nosql.felk.cvut.cz** * Login and password: sent by e-mail * Tools: * [[https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html|PuTTY]] 0.70 * [[https://winscp.net/|WinSCP]] 5.13 * Submissions: * Use //sftp// or //WinSCP// to upload your submission files to the NoSQL server * Put these files into a directory //~/assignments/name///, where //name// is a name of a given homework * I.e. //xpath//, //xquery//, //sparql//, //mapreduce//, //riak//, //redis//, //cassandra//, //mongodb//, //neo4j// (case sensitive) * Use //ssh// or //PuTTY// to open a remote shell connection to the NoSQL server * Based on the instructions provided for a given homework assignment, verify that everything is working as expected * Go to the //~/assignments/// directory and execute //sudo submit_execute name//, where //name// is once again the name of the homework * Wait for the confirmation of success, otherwise your homework is not considered to be submitted * Should any complications appear, send your solution by e-mail to //prokoyul@fel.cvut.cz// * Just for your convenience, you can check the submitted files in the //~/submissions/// directory * Once the homework is assessed, you will find comments in this directory, too * Requirements: * Respect the prescribed names of individual files to be submitted (case sensitive) * Place all the files in the root directory of your submission * Do not include shared libraries or files that are not requested * I.e. do not submit files that were not explicitly requested * Do not redirect or suppress both standard and error outputs in your shell scripts * All your files must be syntactically correct and executable without errors ==== 1: XPath ==== * Points: **15** * Assignment: * Create an **XML document** with sample data from the domain of your individual topic * Work with mutually interlinked entities of at least **3 different types** (e.g. lines, flights and tickets) * Insert data about at least **15 particular entities** (e.g. 3 lines, 4 flights, 8 tickets) * Create expressions for exactly **5 different XPath queries** (i.e. not more, not less) * Use each of the following constructs at least once * Axes: //descendant// or //descendant-or-self// or //%%//%%// abbreviation * Axes: //ancestor(-or-self)// or //preceding(-sibling)// or //following(-sibling)// * Predicates (all of the following): path expression (existence test), position testing, value comparison, general comparison * Requirements: * Both the XML document and queries must be **well-formed** (i.e. syntactically correct) * Put each XPath expression into a standalone file (e.g. //xpath1.xp//) * Always add a comment describing the intended **query meaning in natural language** via //(: comment :)// * Each query expression must be evaluated to a **non-empty sequence** * Submission: * **data.xml**: XML document with your data to be queried * **xpath1.xp**, ..., **xpath5.xp**: files with XPath expressions * Execution: * Execute the following shell command to evaluate each individual XPath query expression * //saxonb-xquery -s// //$DataFile// //$QueryFile// * //$DataFile// is the input XML document to be queried, i.e. //data.xml// * //$QueryFile// is a file with query expression to be evaluated, e.g. //xpath1.xp// * Tools: * [[http://videlibri.sourceforge.net/cgi-bin/xidelcgi|VideLibri XPath and XQuery Processor]] * [[https://codebeautify.org/xmlvalidator|Code Beautify XML Validator]] * References: * XML: [[http://www.w3.org/TR/xml11/|Extensible Markup Language (XML) 1.1 (Second Edition)]] - W3C Recommendation (16 August 2006) * XPath: [[https://www.w3.org/TR/xpath-31/|XML Path Language (XPath) 3.1]] - W3C Recommendation (21 March 2017) * Server: **nosql.felk.cvut.cz** * Do not forget to execute the homework submission script! * Deadline: **Sunday 9. 10. 2022** until 23:59 ==== 2: XQuery ==== * Points: **15** * Assignment: * Create an **XML document** with sample data from the domain of your individual topic * Work with mutually interlinked entities of at least **3 different types** (e.g. lines, flights and tickets) * Insert data about at least **15 particular entities** (e.g. 3 lines, 4 flights, 8 tickets) * This document may (or may not) be identical to the one from the previous assignment on XPath * Create expressions for exactly **5 different XQuery queries** (that cannot be expressed solely using XPath) * Use each of the following constructs at least once * Direct or computed constructor * FLWOR expression (with at least one //for//, //let//, //where// and //order by// clauses) * Aggregate function (//min//, //max//, //avg// or //sum//) * Conditional expression * Existential or universal quantifier * Requirements: * Both the XML document and queries must be **well-formed** (i.e. syntactically correct) * Put each XQuery expression into a standalone file (e.g. //xquery1.xq//) * Always add a comment describing the intended **query meaning in natural language** via //(: comment :)// * Each query expression must be evaluated to a **non-empty sequence** * Submission: * **data.xml**: XML document with your data to be queried * **xquery1.xq**, ..., **xquery5.xq**: files with XQuery expressions * Execution: * Execute the following shell command to evaluate each individual XQuery query expression * //saxonb-xquery -s// //$DataFile// //$QueryFile// * //$DataFile// is the input XML document to be queried, i.e. //data.xml// * //$QueryFile// is a file with query expression to be evaluated, e.g. //xquery1.xq// * Tools: * [[http://videlibri.sourceforge.net/cgi-bin/xidelcgi|VideLibri XPath and XQuery Processor]] * [[https://codebeautify.org/xmlvalidator|Code Beautify XML Validator]] * References: * XML: [[http://www.w3.org/TR/xml11/|Extensible Markup Language (XML) 1.1 (Second Edition)]] - W3C Recommendation (16 August 2006) * XQuery: [[https://www.w3.org/TR/xquery-31/|XQuery 3.1: An XML Query Language]] - W3C Recommendation (21 March 2017) * Server: **nosql.felk.cvut.cz** * Do not forget to execute the homework submission script! * Deadline: **Sunday 16. 10. 2022** until 23:59 ==== 3: SPARQL ==== * Points: **20** * Assignment: * Create a **TTL document** with sample RDF triples within your individual topic * Use the Turtle notation in particular * Work with mutually interlinked resources of at least **3 different types** (e.g. lines, flights and tickets) * Insert data about at least **15 particular resources** (e.g. 3 lines, 4 flights, 8 tickets) * Use each of the following constructs at least once * Object list or predicate-object list * Blank nodes (either using //_// prefix or brackets //[]//) * Create expressions for exactly **5 different SPARQL queries** (//SELECT// query form in particular) * Use each of the following constructs at least once * Basic graph pattern * Group graph pattern * Optional graph pattern (//OPTIONAL//) * Alternative graph pattern (//UNION//) * Difference graph pattern (//MINUS//) * //FILTER// constraint * Aggregation (//GROUP BY// with or without //HAVING// clause) * Sorting (//ORDER BY// clause) * Requirements: * Both TTL document and queries must be **well-formed** (i.e. syntactically correct) * Put each SPARQL query expression into a standalone file (e.g. //query1.sparql//) * Always add a comment describing the intended **query meaning in natural language** via //# comment// * Each query expression must be evaluated to a **non-empty solution sequence** * Both the data file a query files must contain declarations of all prefixes used, including //rdf:// and similar * Use //@prefix rdf: .// in your data file * Use //PREFIX rdf: // in your query file * Do not use //FROM// clauses in your queries, the input data file will automatically be accessible as the default graph * Submission: * **data.ttl**: TTL document with your RDF data to be queried * **query1.sparql**, ..., **query5.sparql**: files with SPARQL query expressions * Execution: * Execute the following shell command to evaluate each individual SPARQL query expression * //sparql --data $DataFile --query $QueryFile// * //$DataFile// is the input RDF document to be queried, i.e. //data.ttl// * //$QueryFile// is a file with query expression to be evaluated, e.g. //query1.sparql// * Tools: * [[http://ttl.summerofcode.be/|IDLab Turtle Validator]] * References: * RDF: [[https://www.w3.org/TR/rdf11-concepts/|RDF 1.1 Concepts and Abstract Syntax]] - W3C Recommendation (25 February 2014) * TTL: [[https://www.w3.org/TR/turtle/|RDF 1.1 Turtle: Terse RDF Triple Language]] - W3C Recommendation (25 February 2014) * SPARQL: [[https://www.w3.org/TR/sparql11-query/|SPARQL 1.1 Query Language]] - W3C Recommendation (21 March 2013) * Server: **nosql.felk.cvut.cz** * Do not forget to execute the homework submission script! * Deadline: **Sunday 23. 10. 2022** until 23:59 ==== 4: MapReduce ==== * Points: **20** * Assignment: * Create an **input text file** with sample data from the domain of your individual topic * Insert realistic and non-trivial data about at least **10 entities of one type** * Put each of these entities on a separate line, i.e. assume that **each line of the input file yields one input record** * Organize the actual entity attributes in whatever way you are able to easily parse * E.g. //Medvídek 2007 53 100 Trojan Macháček Vilhelmová// corresponding to a pattern //Movie Year Rating Length Actors...// * Implement a non-trivial **MapReduce job** * Choose from aggregation, grouping, filtering or any other general MapReduce usage pattern * Use //WordCount.java// source file as a basis for your own implementation * Both the //Map// and //Reduce// functions should be non-trivial, each about 10 lines of code * It is not necessary to implement the //Combine// function * Comment the source file and also **provide a description of the problem** you are solving * You may also create a shell script that allows for the execution of your entire MapReduce job * I.e. compile source files, deploy input file, execute the actual job, retrieve its result, ... * However, this script is not supposed to be submitted and serves just for your own convenience * Even if you do so, it will not be used for the purpose of homework assessment in any way * Requirements: * You may split your MapReduce job implementation into multiple **Java source files** * They all must be located in the submission root directory * At least //MapReduce.java// source file with its public //MapReduce// class is required * This class is expected to represent the main class of the entire MapReduce job * Do not change the way how **command line arguments** are processed * I.e. the only two arguments represent the input and output HDFS locations respectively * Do not use packages in order to organize your Java source files * Assume that only //hadoop-common-3.1.1.jar// and //hadoop-mapreduce-client-core-3.1.1.jar// libraries will be linked with your project * Do not submit your Netbeans (or any other) project directory, do not submit Hadoop (or any other) libraries * Use Java Standard Edition version 7 or newer * You are free to use your ///user/f221_login/// **HDFS home directory** for debugging * Homework assessment will take place in a different dedicated HDFS directory * Submission: * **readme.txt**: description of the input data structure and objective of the MapReduce job * **input.txt**: text file with your sample input data (i.e. only one input file is permitted) * **MapReduce.java** and possibly additional ***.java**: Java source files with your MapReduce implementation * **output.txt**: expected output of your MapReduce job (i.e. submit the result of the execution you performed by yourself) * Tools: * [[http://hadoop.apache.org/|Apache Hadoop]] 3.1.1 (installed on the NoSQL server) * References: * HDFS: [[https://hadoop.apache.org/docs/r3.1.1/hadoop-project-dist/hadoop-common/FileSystemShell.html|Hadoop File System Shell commands]] * MapReduce: [[https://hadoop.apache.org/docs/r3.1.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html|MapReduce Tutorial]] * MapReduce: [[https://hadoop.apache.org/docs/r3.1.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html|MapReduce Commands Guide]] * Hadoop: [[https://hadoop.apache.org/docs/r3.1.1/api/|Hadoop JavaDoc API Documentation]] * Server: **nosql.felk.cvut.cz** * Do not forget to execute the homework submission script! * Deadline: **Monday 7. 11. 2022** until 23:59 ==== 5: Redis ==== * Points: **10** * Assignment: * **Create a script** (ordinary text file) **with a sequence of commands working with Redis** * Illustrate you can work with all data types (strings, lists, sets, sorted sets and hashes) * In particular, perform all the following operations: * **Strings**: 5 insertions (SET), 1 read (GET), 1 update (APPEND, SETRANGE, INCR, ...), 1 removal (DEL). * **Lists**: 5 insertions (LPUSH, RPUSH, ...), 2 different reads (LPOP, RPOP, LINDEX, LRANGE), 1 removal (LREM). * **Sets**: 5 insertions (SADD), 2 different reads (SISMEMBER, SUNION, SINTER, SDIFF), 1 removal (SREM). * **Sorted sets**: 5 insertions (ZADD), 1 read (ZRANGE, ZRANGEBYSCORE), 1 update (ZINCRBY), 1 removal (ZREM). * **Hashes**: 5 insertions (HSET, HMSET), 2 different reads (HGET, HMGET, HKEYS, HVALS, ...), 1 removal (HDEL). * Your database (i.e. keys and values) as well as commands must be **realistic** and within your **individual topic** * E.g. use a hash to store a mapping from seats to passengers for each flight * //HMSET seat-map-EK140-20171121 42A Peter 65F John// * Key //seat-map-EK140-20171121// is composed from a fixed prefix (//seat-map//), flight number (//EK140//) and date of departure (//20171121//) * The actual mapping contains pairs of seat numbers and passenger names, e.g. //42A Peter// * **Add comments** to your script using the //ECHO// command * Describe at least the intended structure of your keys and values in natural language * Requirements: * Only use the **database** you are supposed to use when working on the assignment * Your database number is in the gray column in the table with points * **Do not switch to your database** when you are inside your script * I.e. do not use a //SELECT// command to change the active database from within the script * Specify the intended database number outside your script using command line options (see below) * Note that a different dedicated database will be used when assessing your homework * You can assume that this database will be completely empty at the beginning * Submission: * **script.txt**: text file with Redis database commands * Execution: * Execute the following shell command to evaluate the whole REDIS script * //cat// //$ScriptFile// //| redis-cli -n ////$DatabaseNumber// * //$ScriptFile// is a file with REDIS commands to be executed, i.e. //script.txt// * //$DatabaseNumber// is a number of database to be used, e.g. //5// * Tools: * [[http://redis.io/|Redis]] 3.2.4 (installed on the NoSQL server) * References: * [[https://redis.io/commands|Redis Commands]] * [[https://redis.io/documentation|Redis Documentation]] * [[https://redis.io/topics/data-types|Redis Data Types]] * Server: **nosql.felk.cvut.cz** * Do not forget to execute the homework submission script! * Deadline: **Monday 7. 11. 2022** until 23:59 ==== 6: Cassandra ==== * Points: **15** * Assignment: * **Create a script** (ordinary text file) **with a sequence of CQL statements working with Cassandra database** * **Define a schema for 2 tables** for entities of different types * Define at least one column for each of the following data types: **//tuple//, //list//, //set// and //map//** * **Insert about 5 rows** into each of your tables * Express at least **3 update statements** * You must perform replace, add and remove primitive operations (all of them) on columns of all collection types (all of them) * I.e. you must involve at least altogether 9 different primitive operations on such columns * Express **3 select statements** * Use //WHERE// and //ORDER BY// clauses at least once (both of them) * Use //ALLOW FILTERING// in a query that cannot be evaluated without this instruction * Create at least **1 secondary index** * Requirements: * Only **use your own keyspace** when working on the assignment * Name of this keyspace must be identical to your login name (//f221_login//) * Do not create this keyspace in your script (assume it already exists) * **Do not switch to your keyspace** when you are inside your script * I.e. do not execute a //USE// command to change the active keyspace from within the script * Specify the intended keyspace outside your script using command line options (see below) * Note that a different dedicated keyspace will be used when assessing your homework * You can assume that this keyspace will be completely empty at the beginning * Comments: * The following error messages can be ignored: * //Error from server: code=1300 [Replica(s) failed to execute read]...// * Submission: * **script.cql**: text file with CQL statements * Execution: * Execute the following shell command to evaluate the whole CQL script * //cqlsh -k//// $KeyspaceName -f ////$ScriptFile// * //$KeyspaceName// is a name of keyspace that should be used (must already exist), e.g. //f221_login// * //$ScriptFile// is a file with CQL queries to be executed, i.e. //script.cql// * Tools: * [[http://cassandra.apache.org/|Apache Cassandra]] 3.11.1 (installed on the NoSQL server) * References: * [[http://cassandra.apache.org/doc/latest/cql/|The Cassandra Query Language (CQL)]] * Server: **nosql.felk.cvut.cz** * Do not forget to execute the homework submission script! * Deadline: **Monday 14. 11. 2022** until 23:59 ==== 7: MongoDB ==== * Points: **20** * Assignment: * **Create a JavaScript script with a sequence of commands working with MongoDB database** * Explicitly **create 2 collections** for entities of different types * I.e., create them using //createCollection// method * **Insert about 5 documents** into each one of them * These documents must be realistic, non-trivial, and with both **embedded objects and arrays** * Interlink the documents using **references** * Use //insert// operation at least once * Express **3 update operations** (do not use //save// operation for this purpose) * One without update operators * One with at least 2 different update operators * One based on the //upsert// mode * Express **5 find queries** (with non-trivial selections) * Use at least one logical operator (//$and//, //$or//, //$not//) * Use //$elemMatch// operator on array fields at least once * Use both positive and negative projection (each at least once) * Use //sort// modifier * **Describe the real-world meaning** of all your queries in comments * Express **1 MapReduce query** (non-trivial, i.e. not easily expressed using ordinary //find// operation) * Describe its meaning, contents of intermediate key-value pairs and the final output * Note that //reduce// function must be associative, commutative, and idempotent * Requirements: * Call //export LC_ALL=C// in case you have difficulties in launching the //mongo// shell * Only **use your own database** when working on the assignment * Name of this database must be identical to your login name (//f221_login//) * **Do not switch to your database** when you are inside your script * I.e. do not execute //USE database// and nor //db.getSiblingDB('database')// commands * Specify the intended database outside your script using command line options (see below) * Note that a different dedicated database will be used when assessing your homework * You can assume that this database will be completely empty at the beginning * Print the **output of your queries** (//find// operations) * Use //db.collection.find().forEach(printjson);// approach for this purpose * Print the **output of your MapReduce job** using //out: { inline: 1 }// option * I.e. do not redirect the output into a standalone collection * Submission: * **script.js**: JavaScript script with MongoDB database commands * Execution: * Execute the following shell command to evaluate the whole MongoDB script * //mongosh "mongodb:/⁠/nosql.felk.cvut.cz:42222/////$database" -u // //$username -p// //$password --authenticationDatabase admin// < //$file// * //$login// is your username, e.g. //f221_login// * //$database// - database to connect to (same as login) * //$password// is your password * //$file// is a file with MongoDB queries to be executed, i.e. //script.js// * Tools: * [[http://www.mongodb.com/|MongoDB]] 6.0.1 (installed on the NoSQL server) * References: * [[https://docs.mongodb.com]] * Server: **nosql.felk.cvut.cz** * Do not forget to execute the homework submission script! * Deadline: **Monday 28. 11. 2022** until 23:59 ==== 8: MongoDB-2 ==== * Points: **15** * Assignment: * **Create a JavaScript script with a sequence of commands working with MongoDB database** * Use **2 created collections** for entities of different types * If necessary, **insert more documents** into each one of them * Express **5 aggregate operations** * Use at least once //$match//, //$group//, //$sort//, //$project// (or //$addFields//), //$skip// and //$limit// stages * Use at least once //$sum// (or //$avg//), //$count//, //$min// (or //$max//), //$first// (or //$last//) aggregators * **Describe the real-world meaning** of all your queries in comments * Requirements: * Only **use your own database** when working on the assignment * Name of this database must be identical to your login name (//f221_login//) * **Do not switch to your database** when you are inside your script * I.e. do not execute //USE database// and nor //db.getSiblingDB('database')// commands * Specify the intended database outside your script using command line options (see below) * Submission: * **script.js**: JavaScript script with MongoDB database commands * Create folder for submission: mkdir -p ~/assignments/mongodb2 * Add a script there ~/assignments/mongodb2/script.js * Submit homework cd ~/assignments/mongodb2 sudo submit_execute mongodb2 * Execution: * Execute the following shell command to evaluate the whole MongoDB script * //mongosh "mongodb:/⁠/nosql.felk.cvut.cz:42222/////$database" -u // //$username -p// //$password --authenticationDatabase admin// < //$file// * //$login// is your username, e.g. //f221_login// * //$database// - database to connect to (same as login) * //$password// is your password * //$file// is a file with MongoDB queries to be executed, i.e. //script.js// * Tools: * [[http://www.mongodb.com/|MongoDB]] 6.0.1 (installed on the NoSQL server) * References: * [[https://docs.mongodb.com]] * Server: **nosql.felk.cvut.cz** * Do not forget to execute the homework submission script! * Deadline: **Monday 5. 12. 2022** until 23:59 ==== Extra homework on MongoDB ==== * Points: **10** * Assignment: [[https://docs.google.com/document/d/1Dt7SZRiKDnyDfRMDtpAXU1EFcNlstb75zQh97b8kjYw/edit?usp=sharing | see in Google document]] * Deadline: **Friday 6. 1. 2023** until 23:59 ==== 9: Neo4j ==== * Points: **20** * Assignment: * Insert realistic **nodes and relationships** into your embedded Neo4j database * Use a single //CREATE// statement for this purpose * Insert altogether at least **10 nodes** for entities of at least 2 different types (i.e. different labels) * Insert altogether at least **15 relationships** of at least 2 different types * Include properties (both for nodes and relationships) * Associate all your nodes with user-defined identifiers * Express **5 Cypher query expressions** * Use at least once //MATCH//, //OPTIONAL MATCH//, //RETURN//, //WITH//, //WHERE//, and //ORDER BY// (sub)clauses (all of them) * Aggregation in at least one query * Requirements: * **Describe the meaning of your Cypher expressions** in natural language (via //%%//%% comment//) * Submission: send by email * **data file** (text file with inserted data), **queries.cypher**: text file with a sequence of Cypher statements (including //CREATE//) and **screenshots/video of execution** * Execution: * Execute the following shell command to evaluate the whole Neo4j script * //cypher-shell -f ////$ScriptFile// * //$ScriptFile// is a file with Cypher queries to be executed, i.e. //queries.cypher// * Tools: * [[http://www.neo4j.org/|Neo4j]] 3.0.7 (installed on the NoSQL server) * References: * [[https://neo4j.com/docs/developer-manual/current/cypher/|Cypher query language]] * [[https://neo4j.com/docs/cypher-refcard/current/|Cypher Reference Card]] * Deadline: **Monday 12. 12. 2022** until 23:59 ===== Individual Topics ===== * Please, fill in your name and surname near one of the topics in the [[https://docs.google.com/spreadsheets/d/1mBR9ADgZkD4msLkugmjl65763VtXGbLd93TUAhatrkQ/edit?usp=sharing|DS2 topics table]] or add your own topic at the bottom of the [[https://docs.google.com/spreadsheets/d/1mBR9ADgZkD4msLkugmjl65763VtXGbLd93TUAhatrkQ/edit?usp=sharing|DS2 topics table]]. * Try to propose your own original topic in the first place * You can also get inspired by the following topics (in alphabetical order, in English and in Czech) * Access system, Accommodation booking, Accommodation comparator, Accommodation sharing, Agricultural production, Air rescue service, Air traffic management, Airline, Airport, Armory, Army, Artworks, Assignment submission, ATM network, Attendance system, Auction, Bakery, Bank, Bank account, Bazaar, Beekeeper, Betting shop, Beverages store, Bike sharing, Black market, Blog, Boat rental, Bookstore, Botanic garden, Brewery, Building materials store, Bus station, Bus tickets, Business register, Cadastre, Cafe, Canteens, Car rental, Car repair shop, Car showroom, Casino, Castles, Catering, Caves, Cemetery, Cinema, City tours, Classbook, Collection and disposal of waste, Collection of laws, College dorm, Computer games, Conference, Construction management, Content management system, Contract register, Convenience store, Cookbook, Cooking classes, Council meetings, Countries of the world, Courier service, Cowshed, Dance school, Deliveries, Desk games, Discussion forum, Doctor's office, Dog park, Dog shelter, Driving school, Drugs, Dump, Educational institution, Elections, Electronic prescriptions, Employee records, Empty houses, Entertainment center, Environmental center, Exhibition, Exhibition grounds, Experience donation, Fairy tales, Farmer markets, Finance manager, Financial advisory, Financial markets, Fire protection, Fishing equipment, Fitness center, Flat owners association, Fleet, Flight ticket booking, Food bank, Food distribution, Football league, Football team, Forest kindergarten, Forwarding company, Foster care, Gallery, Garden center, Gardening colony, Gas station, Glassworks, Golf clubs, Grant agency, Grid, Hair salon, Handyman, Hardware, Health insurance, High school, Highway fees, Hiking trails, Hobby market, Hockey league, Holiday offers, Horse racing, Hospital, Hotel, Housing association, Chamber of deputies, Chess club, Chess competition, Chess database, Incinerator, Industrial zone, Insurance company, Intelligence service, Intersport arena, Job offers, Jurassic park, Kindergarten, Laboratory, Labour office, Language school, Lego, Leisure activities, Library, Log book, Logistics center, Logistics company, Logistics warehouse, Lottery, Luggage storage, Manufacturing processes, Maternity hospital, Medical reimbursement, Meeting scheduling, Menu, Metro operation, Military area, Mobile operator, Mobile phones, Model trains, Morgue, Mountain rescue service, Movies, Multinational company, Multiplex network, Museum, Music festival, Music production, Musical instruments, National parks, Nature reserve, Newspaper publishing, Non-bank loans, Nuclear power plant, Nutritional values, Online exercises, Online streaming service, Orienteering, Outdoor swimming pool, Parking lot, Parts catalog, Patient medical card, Pawnshop, Payment cards, Personal documents, Personal trainer, Pharmacy, Photo album, Pizzeria, Plagiarism detection, Planning calendar, Police database, Political parties, Popular music, Population register, Post, Postal addresses, Poultry farming, Prestashop, Prison, Procurement, Project management, Property administration, Psychiatric hospital, Public greenery, Public transport, Railway network, Real estate agency, Recruitment agency, Refugee camp, Registration of sales, Regulatory fees, Research projects, Research publications, Restaurant, Restaurant reservations, Road closures, Room reservation, Scout group, Scrapyard, Security agency, Seizures, Shared travel, Shooting range, Shopping center, Ski school, Skiing area, Sobering-up cell, Social benefits, Social network, Software development, Spare parts, Sports club, Sports tournament, Stable, Statement of work, Stock exchange, Student book, Study abroad, Study materials, Study system, Subsidy programs, Summer camp, Supermarket, Sweet-shop, Swimming pool, Symphony orchestra, Tax office, Taxi service, Teahouse, Theater, Theater plays, Time tables, Tollgates, Tourism, Tourist group, Traffic accidents, Traffic control center, Train station, Transport company, Transport control, Travel agency, Trial, Truck transport, TV program, TV series, Universe, Vaccination abroad, Veterinary clinic, Video shop, Virtual tours, Visas, War conflicts, Water park, Water supply, Weapons, Weather forecast, Webhosting, Webshop, Wedding dress rental, Wholesale, Winter road cleaning, World heritage list, Zoning plan, Zoo * Adresní místa, Aquapark, Armáda, Aukce, Autobusové nádraží, Autosalon, Autoškola, Banka, Bankovní účet, Bazar, Bezpečnostní agentura, Blog, Botanická zahrada, Burza, Bytové družstvo, Catering, Cestovní kancelář, Cukrárna, Cvičiště pro psy, Čajovna, Černý trh, Čerpací stanice, Dálniční poplatky, Darování zážitků, Deskové hry, Detekce plagiátů, Diskuzní fórum, Divadelní hry, Divadlo, Dodávka vody, Docházkový systém, Dopravní dispečink, Dopravní nehody, Dopravní podnik, Dopravní uzavírky, Doručování zásilek, Dotační programy, Ekologické centrum, Elektronická evidence tržeb, Elektronické recepty, Evidence smluv, Evidence součástek, Evidence zaměstnanců, Exekuce, Farmářské trhy, Filmy, Finanční poradenství, Finanční trhy, Finanční úřad, Fitness centrum, Fotbalová liga, Fotbalový tým, Fotoalbum, Galerie, Golfové kluby, Grantová agentura, Hardware, Hobby market, Hodinový manžel, Hokejová liga, Horská služba, Hotel, Hrady a zámky, Hřbitov, Hudební festival, Hudební nástroje, Hudební produkce, Jaderná elektrárna, Jazyková škola, Jazykové pobyty, Jednání zastupitelstva, Jeskyně, Jídelníček, Jízdenky na autobus, Jízdní řády, Jurský park, Kadeřnický salon, Kamionová doprava, Kasino, Katastr nemovitostí, Kavárna, Kino, Kniha jízd, Knihkupectví, Knihovna, Konference, Koňské dostihy, Koupaliště, Kravín, Kuchařka, Kurýrní služba, Kurzy vaření, Laboratoř, Lékárna, Lékařská karta pacienta, Léky, Lesní školka, Letecká společnost, Letecká záchranná služba, Letiště, Letní tábor, Logistická firma, Logistické centrum, Logistický sklad, Loterie, Lyžařská škola, Lyžařský areál, Márnice, Mateřská škola, Menzy, Městská hromadná doprava, Městské exkurze, Mobilní operátor, Mobilní telefony, Modely vláčků, Multifunkční aréna, Muniční sklad, Muzeum, Mýtné brány, Nabídky dovolené, Nabídky práce, Nadnárodní společnost, Náhradní díly, Národní park, Nebankovní půjčky, Nemocnice, Nutriční hodnoty, Obchodní centrum, Obchodní rejstřík, Očkování do ciziny, Odevzdávání úkolů, Online cvičení, Online půjčovna seriálů, Ordinace lékaře, Orientační běh, Osobní doklady, Osobní trenér, Parkoviště, Pekárna, Personální agentura, Pěstounská péče, Pivovar, Pizzerie, Plánovací kalendář, Plánování schůzek, Platební karty, Plavecký bazén, Pneuservis, Počítačové hry, Pohádky, Pojišťovna, Policejní databáze, Politické strany, Populární hudba, Porodnice, Poslanecká sněmovna, Pošta, Potravinová banka, Požární ochrana, Pracovní úřad, Prázdné domy, Prestashop, Provoz metra, Průmyslová zóna, Předpověď počasí, Přepravní kontrola, Přírodní rezervace, Přístupový systém, Psí útulek, Psychiatrická léčebna, Půjčovna auta, Půjčovna lodí, Půjčovna svatebních šatů, Realitní agentura, Redakční systém, Registr obyvatel, Regulační poplatky, Restaurace, Rezervace letenek, Rezervace místností, Rezervace ubytování, Rezervace v restauraci, Rozvodná síť, Rozvoz jídla, Rybářské potřeby, Řízení leteckého provozu, Řízení projektů, Sázková kancelář, Sbírka zákonů, Sdílená kola, Sdílené cestování, Síť bankomatů, Síť multikin, Skautské středisko, Sklad nápojů, Skládka, Sklárna, Sociální dávky, Sociální síť, Soudní řízení, Spalovna, Spediční firma, Společenství vlastníků jednotek, Sportovní klub, Sportovní turnaj, Správa objektů, Správce financí, Srovnávač ubytování, Stáj, Státy světa, Stavební řízení, Stavebnice lego, Stavebniny, Střední škola, Střelnice, Studijní materiály, Studijní systém, Supermarket, Světové dědictví, Svoz a likvidace odpadů, Symfonický orchestr, Šachová databáze, Šachová soutěž, Šachový klub, Taneční škola, Taxi služba, Televizní program, Televizní seriály, Třídní kniha, Turistické cesty, Turistický oddíl, Turistický ruch, Ubytování v soukromí, Uprchlický tábor, Úschovna zavazadel, Územní plán, Válečné konflikty, Včelař, Večerka, Vědecké projekty, Vědecké publikace, Velkochov drůbeže, Velkoobchod, Veřejná zeleň, Veřejné zakázky, Vesmír, Veterinární klinika, Vězení, Videopůjčovna, Virtuální prohlídky, Víza, Vlakové nádraží, Vojenský prostor, Volby, Volnočasové aktivity, Vozový park, Vrakoviště, Vydavatelství novin, Výkaz práce, Výrobní procesy, Vysokoškolská kolej, Výstava, Výstaviště, Výtvarná díla, Vývoj softwaru, Vzdělávací instituce, Webhosting, Webový obchod, Zábavní centrum, Zahrádkářská kolonie, Zahradnictví, Záchytka, Zastavárna, Zbraně, Zdravotní pojišťovna, Zdravotní úhrady, Zemědělská výroba, Zimní úklid komunikací, Zoologická zahrada, Zpravodajská služba, Žákovská knížka, Železniční síť * Nevertheless, the following topics are **not allowed** this semester * Movies, actors ===== Exam Requirements ===== For online exam: - Use zoom - You must turn on the camera For written exam: - You can use paper or your laptops, the latter is preferable - Strict limitation in time ==== NoSQL Introduction ==== * **Big Data and NoSQL** terms, **V characteristics** (volume, variety, velocity, veracity, value, validity, volatility), **current trends** and challenges (Big Data, Big Users, processing paradigms, ...), principles of **relational databases** (functional dependencies, normal forms, transactions, ACID properties); **types of NoSQL systems** (key-value, wide column, document, graph, ...), their data models, features and use cases; **common features** of NoSQL systems (aggregates, schemalessness, scaling, flexibility, sharding, replication, automated maintenance, eventual consistency, ...) ==== Data Formats ==== * **XML**: constructs (element, attribute, text, ...), content model (empty, text, elements, mixed), entities, well-formedness; document and data oriented XML * **JSON**: constructs (object, array, value), types of values (strings, numbers, ...); **BSON**: document structure (elements, type selectors, property names and values) * **RDF**: data model (resources, referents, values), triples (subject, predicate, object), statements, blank nodes, IRI identifiers, literals (types, language tags); graph representation (vertices, edges); **N-Triples notation** (RDF file, statements, triple components, literals, IRI references); **Turtle notation** (TTL file, prefix definitions, triples, object and predicate-object lists, blank nodes, prefixed names, literals) * **CSV**: constructs (document, header, record, field) ==== XML Databases ==== * Native XML databases vs. XML-enabled relational databases; data model (**XDM**): tree (nodes for document, elements, attributes, texts, ...), document order, reverse document order, sequences, atomic values, singleton sequences * **XPath** language: **path** expressions (relative vs. absolute, evaluation algorithm), path step (axis, node test, predicates), **axes** (forward: child, descendant, following, ...; reverse: parent, ancestor, preceding, ...; attribute), **node tests**, **predicates** (path conditions, position testing, ...), abbreviations * **XQuery** language: path expressions, **direct constructors** (elements, attributes, nested queries, well-formedness), **computed constructors** (dynamic names), **FLWOR** expressions (for, let, where, order by, and return clauses), typical FLWOR use cases (joining, grouping, aggregation, integration, ...), **conditional** expressions (if, then, else), **switch** expressions (case, default, return), universal and existential **quantified** expressions (some, every, satisfies), **comparisons** (value, general, node; errors), atomization of values (elements, attributes) ==== RDF Stores ==== * **Linked Data**: principles (identification, standard formats, interlinking, open license), Linked Open Data Cloud * **SPARQL**: graph pattern matching (solution sequence, solution, variable binding, compatibility of solutions), **graph patterns** (basic, group, optional, alternative, graph, minus); **prologue declarations** (BASE, PREFIX clauses), **SELECT** queries (SELECT, FROM, and WHERE clauses), query **dataset** (default graph, named graphs), **variable assignments** (BIND), **FILTER** constraints (comparisons, logical connectives, accessors, tests, ...), **solution modifiers** (DISTINCT, REDUCED; aggregation: GROUP BY, HAVING; sorting: ORDER BY, LIMIT, OFFSET), **query forms** (SELECT, ASK, DESCRIBE, CONSTRUCT) ==== MapReduce ==== * **Programming models**, paradigms and languages; parallel programming models, process interaction (shared memory, message passing, implicit interaction), problem decomposition (task parallelism, data parallelism, implicit parallelism) * **MapReduce**: programming model (data parallelism, map and reduce functions), **cluster architecture** (master, workers, message passing, data distribution), **map and reduce functions** (input arguments, emission and reduction of intermediate key-value pairs, final output), **data flow phases** (mapping, shuffling, reducing), input parsing (input file, split, record), **execution steps** (parsing, mapping, partitioning, combining, merging, reducing), **combine function** (commutativity, associativity), additional functions (input reader, partition, compare, output writer), **implementation details** (counters, fault tolerance, stragglers, task granularity), usage patterns (aggregation, grouping, querying, sorting, ...) * Apache **Hadoop**: modules (Common, HDFS, YARN, MapReduce), related projects (Cassandra, HBase, ...); **HDFS**: data model (hierarchical namespace, directories, files, blocks, permissions), architecture (NameNode, DataNode, HeartBeat messages, failures), replica placement (rack-aware strategy), FsImage (namespace, mapping of blocks, system properties) and EditLog structures, FS commands (ls, mkdir, ...); **MapReduce**: architecture (JobTracker, TaskTracker), job implementation (Configuration; Mapper, Reducer, and Combiner classes; Context, write method; Writable and WritableComparable interfaces), job execution schema ==== NoSQL Principles ==== * **Scaling**: scalability definition; **vertical scaling** (scaling up/down), pros and cons (performance limits, higher costs, vendor lock-in, ...); **horizontal scaling** (scaling out/in), pros and cons, **network fallacies** (reliability, latency, bandwidth, security, ...), **cluster** architecture; design questions (scalability, availability, consistency, latency, durability, resilience) * **Distribution** models: **sharding**: idea, motivation, objectives (balanced distribution, workload, ...), strategies (mapping structures, general rules), difficulties (evaluation of requests, changing cluster structure, obsolete or incomplete knowledge, network partitioning, ...); **replication**: idea, motivation, objectives, replication factor, architectures (master-slave and peer-to-peer), internal details (handling of read and write requests, consistency issues, failure recovery), replica placement strategies; mutual combinations of **sharding and replication** * **CAP** theorem: CAP guarantees (consistency, availability, partition tolerance), CAP theorem, consequences (CA, CP and AP systems), consistency-availability spectrum, **ACID properties** (atomicity, consistency, isolation, durability), **BASE properties** (basically available, soft state, eventual consistency) * **Consistency**: strong vs. eventual consistency; **write consistency** (write-write conflict, context, pessimistic and optimistic strategies), **read consistency** (read-write conflict, context, inconsistency window, session consistency), **read and write quora** (formulae, motivation, workload balancing) ==== Key-Value Stores ==== * Data model (key-value pairs), **key management** (real-world identifiers, automatically generated, structured keys, prefixes), basic CRUD operations, use cases, representatives, extended functionality (MapReduce, TTL, links, structured store, ...) * **Redis**: features (in-memory, data structure store), data model (databases, objects), data types (string, list, set, sorted set, hash), **string** commands (SET, GET, APPEND, SETRANGE, INCR, DEL, ...), **list** commands (LPUSH, RPUSH, LPOP, RPOP, LINDEX, LRANGE, LREM, ...), **set** commands (SADD, SISMEMBER, SUNION, SINTER, SDIFF, SREM, ...), **sorted set** commands (ZADD, ZRANGE, ZRANGEBYSCORE, ZINCRBY, ZREM, ...), **hash** commands (HSET, HMSET, HGET, HMGET, HKEYS, HVALS, HDEL, ...), **general** commands (EXISTS, KEYS, DEL, RENAME, ...), **time-to-live** commands (EXPIRE, TTL, PERSIST) ==== Wide Column Stores ==== * Data model (column families, rows, columns), query patterns, use cases, representatives * **Cassandra**: data model (keyspaces, tables, rows, columns), primary keys (partition key, clustering columns), column values (missing; empty; native data types, tuples, user-defined types; collections: lists, sets, maps; frozen mode), additional data (TTL, timestamp); **CQL** language: DDL statements: **CREATE KEYSPACE** (replication options), DROP KEYSPACE, USE keyspace, **CREATE TABLE** (column definitions, usage of types, primary key), DROP TABLE, TRUNCATE TABLE; native data types (int, varint, double, boolean, text, timestamp, ...); literals (atomic, collections, ...); DML statements: **SELECT** statements (SELECT, FROM, WHERE, GROUP BY, ORDER BY, and LIMIT clauses; DISTINCT modifier; selectors; non/filtering queries, ALLOW FILTERING mode; filtering relations; aggregates; restrictions on sorting and aggregation), **INSERT** statements (update parameters: TTL, TIMESTAMP), **UPDATE** statements (assignments; modification of collections: additions, removals), **DELETE** statements (deletion of rows, removal of columns, removal of items from collections) ==== Document Stores ==== * Data model (documents), query patterns, use cases, representatives * **MongoDB**: data model (databases, collections, documents, field names), document identifiers (features, ObjectId), data modeling (embedded documents, references); CRUD operations (insert, update, save, remove, find); **insert** operation (management of identifiers); **update** operation: replace vs. update mode, multi option, upsert mode, update operators (field: set, rename, inc, ...; array: push, pop, ...); **save** operation (insert vs. replace mode); **remove** operation (justOne option); **find** operation: query conditions (value equality vs. query operators), query operators (comparison: eq, ne, ...; element: exists; evaluation: regex, ...; logical: and, or, not; array: all, elemMatch, ...), dot notation (embedded fields, array items), querying of arrays, projection (positive, negative), projection operators (array: slice, elemMatch), **modifiers** (sort, skip, limit); **MapReduce** (map function, reduce function, options: query, sort, limit, out); primary and secondary **index structures** (index types: value, hashed, ...; forms; properties: unique, partial, sparse, TTL) ==== Graph Databases ==== * Data model (property graphs), use cases, representatives * **Neo4j**: data model (graph, nodes, relationships, directions, labels, types, properties), properties (fields, atomic values, arrays); embedded database mode; **traversal framework**: traversal description, **order** (breadth-first, depth-first, branch ordering policies), **expanders** (relationship types, directions), **uniqueness** (NODE_GLOBAL, RELATIONSHIP_GLOBAL, ...), **evaluators** (INCLUDE/EXCLUDE and CONTINUE/PRUNE results; predefined evaluators: all, excludeStartPosition, ...; custom evaluators: evaluate method), traverser (starting nodes, iteration modes: paths, end nodes, last relationships); Java interface (labels, types, nodes, relationships, properties, transactions); **Cypher** language: graph matching (solutions, variable bindings); query sub/clauses (read, write, general); **path patterns**, node patterns (variable, labels, properties), relationship patterns (variable, types, properties, variable length); **MATCH** clause (path patterns, WHERE conditions, uniqueness requirement, OPTIONAL mode); **RETURN** clause (DISTINCT modifier, ORDER BY, LIMIT, SKIP subclauses, aggregation); **WITH** clause (motivation, subclauses); **write clauses**: CREATE, DELETE (DETACH mode), SET (properties, labels), REMOVE (properties, labels); query structure (chaining of clauses, query parts, restrictions) ==== Advanced Aspects ==== * **Graph databases**: non/transactional databases, query patterns (CRUD, graph algorithms, graph traversals, graph pattern matching, similarity querying); data **structures** (adjacency matrix, adjacency list, incidence matrix, Laplacian matrix), graph **traversals**, **data locality** (BFS layout, matrix bandwidth, bandwidth minimization problem, Cuthill-McKee algorithm), **graph partitioning** (1D partitioning, 2D partitioning, BFS evaluation), graph matching (sub-graph, super-graph patterns), non/mining based indexing * **Performance tuning**: scalability goals (reduce latency, increase throughput), Amdahl's law, Little's law, message cost model * **Polyglot persistence** ===== Recommended Literature ===== **This course is based on [[https://www.ksi.mff.cuni.cz/~svoboda/courses/211-B4M36DS2/|materials by Martin Svoboda]]** * Holubová, Irena - Kosek, Jiří - Minařík, Karel - Novák, David: [[http://www.ksi.mff.cuni.cz/bigdata/|Big Data a NoSQL databáze]].\\ ISBN: 978-80-247-5466-6 (hardcover), 978-80-247-5938-8 (eBook PDF), 978-80-247-5939-5 (eBook EPUB).\\ Grada Publishing, a.s., 2015. * Sadalage, Pramod J. - Fowler, Martin: [[http://martinfowler.com/books/nosql.html|NoSQL Distilled]].\\ ISBN: 978-0-321-82662-6.\\ Pearson Education, Inc., 2013. * Wiese, Lena: [[https://www.degruyter.com/viewbooktoc/product/460529|Advanced Data Management: For SQL, NoSQL, Cloud and Distributed Databases]].\\ ISBN: 978-3-11-044140-6 (hardcover), 978-3-11-044141-3 (eBook PDF), 978-3-11-043307-4 (eBook EPUB).\\ DOI: [[http://doi.org/10.1515/9783110441413|10.1515/9783110441413]].\\ Walter de Gruyter GmbH, 2015. * Zomaya, Albert Y. - Sakr, Sherif: [[http://www.springer.com/gp/book/9783319493398|Handbook of Big Data Technologies]].\\ ISBN: 978-3-319-49339-8 (hardcover), 978-3-319-49340-4 (eBook).\\ DOI: [[http://doi.org/10.1007/978-3-319-49340-4|10.1007/978-3-319-49340-4]].\\ Springer International Publishing AG, 2017.