courses:b4m36ds2:tutorials:start [CourseWare Wiki]

Homework Assignments

Preliminaries:
- NoSQL server: nosql.felk.cvut.cz
- Login and password: sent by e-mail
Tools:
- PuTTY
- WinSCP
Submissions:
- Use sftp or WinSCP to upload your submission files to the NoSQL server
- Put these files into a directory ~/assignments/name/, where name is a name of a given homework
- I.e. postgresql, mapreduce, redis, cassandra, mongodb, neo4j (case sensitive)
- Use ssh or PuTTY to open a remote shell connection to the NoSQL server
- Based on the instructions provided for a given homework assignment, verify that everything is working as expected
- Go to the ~/assignments/ directory and execute sudo submit_execute name, where name is the name of the homework
- Wait for the confirmation of success. Otherwise, your homework is not considered to be submitted
- Should any complications appear, send your solution by e-mail to prokoyul@fel.cvut.cz
- Just for your convenience, you can check the submitted files in the ~/submissions/ directory
- Upload to BRUTE the same script + a screenshot of its execution on the NoSQL server.
  - Use the filename: username_number.pdf, where the number is the homework number.
- Once the homework is assessed, you will find points and comments in BRUTE
Requirements:
- Respect the prescribed names of individual files to be submitted (case sensitive)
- Place all the files in the root directory of your submission
- Do not include shared libraries or files that are not requested
- I.e. do not submit files that were not explicitly requested
- Do not redirect or suppress both standard and error outputs in your shell scripts
- All your files must be syntactically correct and executable without errors

General homework assignment

Develop an application that uses multiple databases for different tasks within a project (polyglot persistence).

The project aims to demonstrate how different database systems can work together effectively to provide high performance, scalability, and usability.

Requirements:
- The project should reasonably utilize database systems:
  - PostgreSQL
  - Redis
  - Cassandra
  - MongoDB
  - Neo4j
- Describe the subject area and write the functional requirements for the project (1 point).
- Describe each project block, select an appropriate system, and justify your choice (2 points).
- Implement basic queries for each of the five databases according to the specific requirements below (14 points).
- Integrate these queries into a working web application.
  - Successful integration of each database system (5 points, 1 point per database)
  - Simple graphical interface (2 points).
  - Demonstrate user interaction with data from each database through the web interface (1 point).
  - Write brief explanations of how each database is used in the project (1 point)
- Total base points: 26.

Optional extensions.
*Design each project block as a microservice (+3 points).
Implement data synchronization between different blocks
- Basic synchronization (e.g., clear cart in Redis at checkout) (+1 point)
- Advanced synchronization across multiple databases (+3 points)
Implement error handling and data validation (+2 points)

The minimum passing grade will be awarded for implementing basic native queries for each of the five database systems.
Important notice for homework assignments and final Project:
- Homework Assignments:
  - Each homework assignment must include console-based native queries for the specific database system.
  - These queries will be checked and evaluated separately on an empty database.
  - Your script should work independently, creating necessary data and executing the queries specified in the assignment.
  - Include all required commands, even if some seem less practical in a real-world application context.
- Final Project Integration:
  - For the final project, integrate the queries from your homework into a functioning application.
  - If specific queries from the homework seem impractical in the context of your application, you may omit them in the final version.
  - However, be prepared to explain your decision.
  - Ensure that your application demonstrates the practical use of each database system.
- Data Initialization:
  - Both homework scripts and the final project should include data initialization commands (e.g., INSERT statements, data creation methods).
  - This ensures that your work can be evaluated in a clean database environment.
- Remember: The goal is to demonstrate proficiency with each database system through both isolated queries (homework) and practical application (final project).

You can organize into teams of up to 3 students to work on the project.
In this case, the requirements for the number of queries are multiplied by the number of students in the group, or each student implements different queries in the homework assignment

HW0: Topic selection

Points: 3

Assignment:

1. Choose your distinct topic.

Example: The online store will specialize in custom furniture, allowing users to select designs, materials, and dimensions. The platform will have various features such as product inventory, user sessions, shopping carts, purchase history, activity logs, and personalized recommendations.

2. Describe the subject area - tell about your project, its participants, and the tasks inside the project.

3. Write the functional requirements for the project.

4. Separate the parts of your project so that each of the listed database systems (PostgreSQL, Redis, MongoDB, Cassandra, Neo4j) will be used at least once.

Example:

Product inventory – PostgreSQL,
User accounts – PostgreSQL,
User sessions – Redis,
Shopping carts – Redis,
Purchase history – MongoDB,
Activity logs – Cassandra,
Personalized recommendations – Neo4j,
Caching frequently accessed data – Redis.

4. Write the arguments for each choice.

Submission:
- Submit the topic to BRUTE in the form of a text document (.txt)
- Insert your topic into the DS2 topics table
- Deadline: Sunday 6. 10. 2024 until 23:59

HW1: Relational database (PostgreSQL)

Points: 2 (+1 bonus: 0.5 for triggers, 0.5 bonus for JSON types)
Assignment:

1. Create the first block of the project. It must be a relational database. Create ER-model and tables.

2. Fill these tables with data. The data must be realistic.

3. Create indexes to speed up common queries.

4. Extra: use triggers for insert, update, etc.

5. Extra: use JSON types and write queries to process data stored in a JSON object (use @> and →).

Review the further homework and prepare data suitable for all of them. The main table must have over 50 rows.

Submission:
- Upload to BRUTE: .sql file with SQL-queries, image with ER-model, image with a screenshot.
- Execute the queries on the nosql.felk.cvut.cz server. Do not forget to execute the homework submission script!
Tools:
- PostgreSQL 16.4 (installed on the NoSQL server)
Remote access to PostgreSQL:
- Address: nosql.felk.cvut.cz.
- Default port (5432)
- login = database = f24_username
- password: initial password from email
References:
- PostgreSQL Documentation
- JSON Types in PostgreSQL
Deadline: Sunday 20. 10. 2024 until 23:59

HW2: Redis

Points: 3
Assignment:

Caching using Redis – 2 points
- Prepare at least 3 SQL queries to the database (PostgreSQL). It is assumed that these will be queries frequently asked by users (e.g., search for goods by some criterion).
- Use caching with Redis to speed up the execution of these queries. The database is queried if there are no query results in the cache (with an appropriate notification). If the query is already in the cache, the result is issued directly from the cache.

Redis data types – 2 points
Use Redis to organize a user's shopping cart, user sessions, etc., and perform all the following operations:
- Strings: 5 insertions (SET), 1 read (GET), 1 update (APPEND, SETRANGE, INCR, …), 1 removal (DEL).
- Lists: 5 insertions (LPUSH, RPUSH, …), 2 different reads (LPOP, RPOP, LINDEX, LRANGE), 1 removal (LREM).
- Sets: 5 insertions (SADD), 2 different reads (SISMEMBER, SUNION, SINTER, SDIFF), 1 removal (SREM), 1 SCARD.
- Sorted sets: 5 insertions (ZADD), 1 read (ZRANGE, ZRANGEBYSCORE), 1 update (ZINCRBY), 1 removal (ZREM, ZREMRANGEBYSCORE), 1 ZCARD or ZCOUNT.
- Hashes: 5 insertions (HSET, HMSET), 2 different reads (HGET, HMGET, HKEYS, HVALS, …), 1 removal (HDEL).
- Geographic coordinates: 5 insertions, 1 GEOSEARCH, 1 GEODIST.

Add comments to your script using the ECHO command
Describe at least the intended structure of your keys and values in natural language

Your PostgreSQL database name is db_username (for example, db_f241_student). Use the same password you received for your account at the beginning of the semester. Details about the remote connection are sent to you by email.

Requirements:
- Only use the database you are supposed to use when working on the assignment

Submission:
- script_sql.txt: text file with SQL queries, script.txt: text file with Redis database commands, script_pg.py with Python script.
- Submit to BRUTE these three files and a screenshot of the execution on the NoSQL server

* Execution:

Execute the following shell command to evaluate the whole REDIS script
- cat $ScriptFile | redis-cli -n $DatabaseNumber
- $ScriptFile is a file with REDIS commands to be executed, i.e. script.txt
- $DatabaseNumber is a number of database to be used, e.g. 5

Tools:
Redis 7.4.0 (installed on the NoSQL server)
References:
Redis Commands
Redis Documentation
Redis Data Types
Server: nosql.felk.cvut.cz
Do not forget to execute the homework submission script!
Deadline: Sunday 3. 11. 2024 until 23:59.

HW3: MongoDB

Points: 2.5 (one collection) or 3 (two collections)
Assignment:
- Implement the project block with MongoDB
- Explicitly create one or two collections (for example, one collection for orders or two collections: orders and order_items)
  - I.e., create it using createCollection method
- Insert 10 documents into each collection
  - These documents must be realistic, non-trivial, and with both embedded objects and arrays
  - Interlink the documents using references
  - Use both insertOne and insertMany
- Express update operations:
  - Replace the value of the document
  - Update an individual field of one document
  - Update an individual field in documents by criterion; use a formula or increase/multiply the value
  - Add and remove a field in the document(s)
  - Modify a nested document
  - Replace, add, and delete a single array element
  - Use the upsert mode at least once
- Express at least 5 find queries (with non-trivial selections, i.e. the first parameter cannot be empty)
  - Use at least one logical operator ($and, $or, $not)
  - Use $elemMatch operator on array fields at least once
  - Use both positive and negative projections (each at least once)
  - Use sort modifier
  - The search criteria should contain (each item at least once):
    - matching the value of one or more fields
    - condition on the value of one or more fields (e.g., containing greater than or less than, etc.)
    - comparing dates or parts of dates (for example, all orders for a given date or all orders for a given year)
    - matching the whole array
    - searching for a value in the array
  - Use lookup (if you have two collections)
  - Describe the real-world meaning of all your queries in comments
- Express 5 aggregate operations. Use 2 created collections for entities of different types. If necessary, insert more documents into each one of them
  - Use at least once each of $match, $group, $sort, $project (or $addFields), $skip and $limit stages
  - Use at least once each of $sum (or $avg), $count, $min (or $max), $first (or $last) aggregators
  - Describe the real-world meaning of all your queries in comments
Requirements:
- All queries must be realistic in the context of your project
- Call export LC_ALL=C in case you have difficulties in launching the mongo shell
- Only use your own database when working on the assignment
  - The name of this database must be identical to your login name (f241_login)
- Do not switch to your database when you are inside your script
  - I.e. do not execute USE database and nor db.getSiblingDB('database') commands
  - Specify the intended database outside your script using command line options (see below)
- Note that a different dedicated database will be used when assessing your homework
  - You can assume that this database will be empty at the beginning
- Print the output of your queries (find and aggregate operations)
  - Use db.collection.find().forEach(printjson); approach for this purpose
Submission:
- script.js: JavaScript script with MongoDB database commands
Execution: mongosh –port 42222 -u login -p password database script.js
- Execute the following shell command to evaluate the whole MongoDB script
  - mongosh –port 42222 -u $login -p $password $database $file
  - $login is your username, e.g. f241_login
  - $database - database to connect to (same as login)
  - $password is your password (Use the same password you received for your account at the beginning of the semester)
  - $file is a file with MongoDB queries to be executed, i.e. script.js
  - Double dashes before port
Tools:
- MongoDB 7.0.14 (installed on the NoSQL server)
References:
- https://docs.mongodb.com
Server: nosql.felk.cvut.cz
- Do not forget to execute the homework submission script!
Deadline: Sunday 24. 11. 2024 until 23:59

HW4: Cassandra

Points: 3
Assignment:
- Implement the project block with Cassandra
- Define a schema for a table
  - Define at least one column for each of the following data types: tuple, list, set and map
- Insert about 10 rows into your table
- Express at least 3 update statements
  - You must perform replace, add and remove primitive operations (all of them) on columns of all collection types (all of them)
  - I.e. you must involve at least altogether 9 different primitive operations on such columns
- Express 3 select statements to retrieve data
  - Use WHERE and ORDER BY clauses at least once (both of them)
  - Use ALLOW FILTERING in a query exactly once
- Create and use at least 1 secondary index
- Create a materialized view
- Write 3 queries to obtain statistical information. Use the materialized view at least once
- Add comments with descriptions to all queries
Requirements:
- Only use your own keyspace when working on the assignment
  - The name of this keyspace must be identical to your login name (f241_login)
  - Do not create this keyspace in your script (assume it already exists)
- Do not switch to your keyspace when you are inside your script
  - I.e. do not execute a USE command to change the active keyspace from within the script
  - Specify the intended keyspace outside your script using command line options (see below)
- Note that a different dedicated keyspace will be used when assessing your homework
  - You can assume that this keyspace will be empty at the beginning
Comments:
- The following error messages can be ignored:
  - Error from server: code=1300 [Replica(s) failed to execute read]…
Submission:
- script.cql: text file with CQL statements
Execution:
- Execute the following shell command to evaluate the whole CQL script
  - cqlsh -k $KeyspaceName -f $ScriptFile
  - $KeyspaceName is a name of keyspace that should be used (must already exist), e.g. f241_login
  - $ScriptFile is a file with CQL queries to be executed, i.e. script.cql
Tools:
- Apache Cassandra 4.1.6 (installed on the NoSQL server)
References:
- The Cassandra Query Language (CQL)
Server: nosql.felk.cvut.cz
- Do not forget to execute the homework submission script!
Deadline: Sunday 1. 12. 2024 until 23:59

HW5: Neo4j

Points: 3
Assignment: Implement the project block with Neo4j
- Insert realistic nodes and relationships into your embedded Neo4j database
  - Use a single CREATE statement for this purpose
  - Insert altogether at least 10 nodes for entities of at least 2 different types (i.e. different labels)
  - Insert altogether at least 15 relationships of at least 2 different types
  - Include properties (both for nodes and relationships)
  - Associate all your nodes with user-defined identifiers
- Express 5 Cypher read query expressions
  - Use at least once MATCH, OPTIONAL MATCH, RETURN, WITH, WHERE, and ORDER BY (sub)clauses (all of them)
  - Aggregation in at least one query
- Express 5 Cypher write or read/write query expressions
  - Use at least once CREATE (with MATCH), DELETE, SET, REMOVE, DETACH (sub)clauses (all of them)
Requirements:
- Describe the meaning of your Cypher expressions in natural language (via // comment)
Submission: BRUTE and NoSQL server
- queries.cypher: text file with a sequence of Cypher statements (including CREATE) and screenshots of execution
Execution:
- Execute the following shell command to evaluate the whole Neo4j script
  - cypher-shell -f $ScriptFile
  - $ScriptFile is a file with Cypher queries to be executed, i.e. queries.cypher
Tools:
- Neo4j 5.23.0 (Java 21 *) (installed on the NoSQL server)
References:
- Cypher query language
- Cypher Reference Card
Deadline: Sunday 15. 12. 2024 until 23:59

HW6: MapReduce

Points: 3
Assignment:
- Create an input text file. This file must be large and can contain data from your project or third-party information.
  - Put each entity on a separate line, i.e. assume that each line of the input file yields one input record
  - Organize the actual entity attributes in whatever way you can easily parse
- Implement a non-trivial MapReduce job
  - Choose from aggregation, grouping, filtering or any other general MapReduce usage pattern
  - Use WordCount.java source file as a basis for your own implementation
  - Both the Map and Reduce functions should be non-trivial, each about 10 lines of code
  - It is not necessary to implement the Combine function
- Comment the source file and also provide a description of the problem you are solving
- You may also create a shell script that allows for the execution of your entire MapReduce job
  - I.e. compile source files, deploy input file, execute the actual job, retrieve its result, …
  - However, this script is not supposed to be submitted and serves just for your own convenience
  - Even if you do so, it will not be used for the purpose of homework assessment in any way
Requirements:
- You may split your MapReduce job implementation into multiple Java source files
  - They all must be located in the submission root directory
  - At least MapReduce.java source file with its public MapReduce class is required
  - This class is expected to represent the main class of the entire MapReduce job
- Do not change the way how command line arguments are processed
  - I.e. the only two arguments represent the input and output HDFS locations respectively
- Do not use packages in order to organize your Java source files
- Assume that only hadoop-common-3.1.1.jar and hadoop-mapreduce-client-core-3.1.1.jar libraries will be linked with your project
- Do not submit your Netbeans (or any other) project directory, do not submit Hadoop (or any other) libraries
- Use Java Standard Edition version 7 or newer
- You are free to use your /user/f231_login/ HDFS home directory for debugging
  - Homework assessment will take place in a different dedicated HDFS directory
Submission (NoSQL server):
- readme.txt: description of the input data structure and objective of the MapReduce job
- input.txt: text file with your sample input data (i.e. only one input file is permitted)
- MapReduce.java and possibly additional *.java: Java source files with your MapReduce implementation
- output.txt: expected output of your MapReduce job
- Upload to BRUTE: readme.txt and screenshot (or screenshots) of the execution of your homework on the NoSQL server.
Tools:
- Apache Hadoop 3.4.0 (installed on the NoSQL server)
References:
- HDFS: Hadoop File System Shell commands
- MapReduce: MapReduce Tutorial
- MapReduce: MapReduce Commands Guide
- Hadoop: Hadoop JavaDoc API Documentation
Server: nosql.felk.cvut.cz
- Do not forget to execute the homework submission script and make a screenshot!
Deadline: Sunday 22. 12. 2024 until 23:59

Bonus assignment #1 (BA1)

Analysis of Sharding and Replication Strategies in NoSQL Databases (+3 p.)

For each of the following sharding strategies (lecture 3):
- Hash sharding
- Range-based sharding
- Directory-based sharding
- Consistent hash sharding
- Entity/relationship-based sharding
- Geography-based sharding
- Time-based sharding
- Functional sharding
- Combined sharding.
Complete the following tasks:
- Create a table listing the advantages and potential issues of the strategy.
- Formulate the conditions under which this strategy would be most effective.
- Provide a specific example of data distribution using this strategy.
- Create a graphical illustration of this example (diagram, chart, or drawing).
- Propose a suitable replication strategy to complement this sharding strategy and briefly justify your choice.
Prepare a brief conclusion comparing the effectiveness of the examined strategies in various NoSQL database usage scenarios.
Deadline: Sunday 22. 12. 2024 until 23:59.

Individual Topics

Please enter the team number, username, and topic in the DS2 topics table.
The following topics are not allowed this semester
- Movies, actors, furniture e-shop

Table of Contents