courses:be4m36ds2:tutorials:start

Homework Assignments
Bonus assignment #1 (BA1)
Bonus assignment #2 (BA2)
Individual Topics

Homework Assignments

Preliminaries:
- NoSQL server: nosql.felk.cvut.cz
- Login and password: sent by e-mail
Tools:
- PuTTY
- WinSCP

VSCode and JetBrains IDE connections that do not work over SSHFS are prohibited due to extensive server usage. There is no documented way to install one shared instance of required tools on a server for all users.
If you still want to use it, you need to mount NoSQL server file system to your PC and open your projects as if they were on your local file system.
- For VSCode there is an SSHFS extension:
  - https://marketplace.visualstudio.com/items?itemName=Kelvin.vscode-sshfs
- For any other IDEs you can mount remote filesystem manually on Linux/MacOS:
  - Install sshfs on your device
  - Mount filesystem:
```
sshfs $USERNAME@nosql.felk.cvut.cz:/home/$USERNAME /mnt/dbs2-server
```
  - Open your project in your IDE under local path
```
/mnt/dbs2-server/$PATH_TO_PROJECT
```
  - When you are finished, you can unmount it
```
sudo umount /mnt/dbs2-server
```
- For Windows, use any tool like SSHFS-Win (follow tool documentation):
  - https://github.com/winfsp/sshfs-win
- Note: on Linux with GNOME/Ubuntu you can do this right from Files app:
  - For latest GNOME 47 use “Networks” section at the top of the sidebar to add connection
  - For older versions, use “Other” section at the bottom of the sidebar
  - The address should be
```
sftp://$USERNAME@nosql.felk.cvut.cz/
```

Submissions

Submissions:
- Use sftp or WinSCP to upload your submission files to the NoSQL server
- Put these files into a directory ~/assignments/name/, where name is a name of a given homework
- I.e. postgresql, mapreduce, redis, cassandra, mongodb, neo4j (case sensitive)
- Use ssh or PuTTY to open a remote shell connection to the NoSQL server
- Based on the instructions provided for a given homework assignment, verify that everything is working as expected
- Go to the ~/assignments/ directory and execute
```
sudo submit_execute name
```
  where name is the name of the homework
- Wait for the confirmation of success. Otherwise, your homework is not considered to be submitted
- Should any complications appear, send your solution by e-mail to prokoyul@fel.cvut.cz
- Just for your convenience, you can check the submitted files in the ~/submissions/ directory
- Upload to BRUTE the same script + a screenshot of its execution on the NoSQL server.
  - Use the filename: username_number.pdf, where the number is the homework number.
- Once the homework is assessed, you will find points and comments in BRUTE
Requirements:
- Respect the prescribed names of individual files to be submitted (case sensitive)
- Place all the files in the root directory of your submission
- Do not include shared libraries or files that are not requested
- I.e. do not submit files that were not explicitly requested
- Do not redirect or suppress both standard and error outputs in your shell scripts
- All your files must be syntactically correct and executable without errors

General homework assignment

Develop an application that uses multiple databases for different tasks within a project (polyglot persistence).

The project aims to demonstrate how different database systems can work together effectively to provide high performance, scalability, and usability.

Requirements:
- The project should reasonably utilize database systems:
  - PostgreSQL
  - Redis
  - Cassandra
  - MongoDB
  - Neo4j
- Describe the subject area and write the functional requirements for the project (1 point).
- Describe each project block, select an appropriate system, and justify your choice (2 points).
- Implement basic queries for each of the five databases according to the specific requirements below (14 points).
- Integrate these queries into a working web application.
  - Successful integration of each database system (5 points, 1 point per database)
  - Simple graphical interface (2 points).
  - Demonstrate user interaction with data from each database through the web interface (1 point).
  - Write brief explanations of how each database is used in the project (1 point)
- Total base points: 26.

Optional extensions.
*Design each project block as a microservice (+3 points).
Implement data synchronization between different blocks
- Basic synchronization (e.g., clear cart in Redis at checkout) (+1 point)
- Advanced synchronization across multiple databases (+3 points)
Implement error handling and data validation (+2 points)

The minimum passing grade will be awarded for implementing basic native queries for each of the five database systems.
Important Notice for Homework and Final Project:
- Homework Assignments:
  - Each homework assignment must include console-based native queries for the specific database system.
  - These queries will be checked and evaluated separately on an empty database.
  - Your script should work independently, creating necessary data and executing the queries specified in the assignment.
  - Include all required commands, even if some seem less practical in a real-world application context.
- Final Project Integration:
  - For the final project, integrate the queries from your homework into a functioning application.
  - If specific queries from the homework seem impractical in the context of your application, you may omit them in the final version.
  - However, be prepared to explain your decision.
  - Ensure that your application demonstrates the practical use of each database system.
- Data Initialization:
  - Both homework scripts and the final project should include data initialization commands (e.g., INSERT statements, data creation methods).
  - This ensures that your work can be evaluated in a clean database environment.
- Remember: The goal is to demonstrate proficiency with each database system through both isolated queries (homework) and practical application (final project).

You can organize into teams of up to 3 students to work on the project.
In this case, the requirements for the number of queries are multiplied by the number of students in the group, or each student implements different queries in the homework assignment

HW0: Topic selection

Points: 3

Assignment:

1. Choose your distinct topic.

Example: The online store will specialize in custom furniture, allowing users to select designs, materials, and dimensions. The platform will have various features such as product inventory, user sessions, shopping carts, purchase history, activity logs, and personalized recommendations.

2. Describe the subject area - tell about your project, its participants, and the tasks inside the project.

3. Write the functional requirements for the project.

4. Separate the parts of your project so that each of the listed database systems (PostgreSQL, Redis, MongoDB, Cassandra, Neo4j) will be used at least once.

Example:

Product inventory – PostgreSQL,
User accounts – PostgreSQL,
User sessions – Redis,
Shopping carts – Redis,
Purchase history – MongoDB,
Activity logs – Cassandra,
Personalized recommendations – Neo4j,
Caching frequently accessed data – Redis.

4. Write the arguments for each choice.

Submission:
- Submit the topic to BRUTE in the form of a text document (.txt)
- Insert your topic into the DS2 topics table
- Deadline: Sunday 6. 10. 2024 until 23:59

HW1: Relational database (PostgreSQL)

Points: 2 (+1 bonus: 0.5 for triggers, 0.5 bonus for JSON types)
Assignment:

1. Create the first block of the project. It must be a relational database. Create ER-model and tables.

2. Fill these tables with data. The data must be realistic.

3. Create indexes to speed up common queries.

4. Extra: use triggers for insert, update, etc.

5. Extra: use JSON types and write queries to process data stored in a JSON object (use @> and →).

Review the further homework and prepare data suitable for all of them. The main table must have over 50 rows.

Submission:
- Upload to BRUTE: .sql file with SQL-queries, image with ER-model, image with a screenshot.
- Execute the queries on the nosql.felk.cvut.cz server. Do not forget to copy the .sql file to ~/assignments/postgresql/ folder and execute the homework submission script!
Tools:
- PostgreSQL 16.4 (installed on the NoSQL server)
Remote access to PostgreSQL:
- Address: nosql.felk.cvut.cz.
- Default port (5432)
- login = database = f24_username
- password: initial password from email
References:
- PostgreSQL Documentation
- JSON Types in PostgreSQL
Deadline: Sunday 20. 10. 2024 until 23:59

HW2: Redis

Points: 3 (+0.5 p. for Redisearch +0.5 p. for RedisJSON)
Assignment:

Caching using Redis – 1 point
- Prepare at least 3 SQL queries to the database (PostgreSQL). It is assumed that these will be queries frequently asked by users (e.g., search for goods by some criterion).
- Use caching with Redis to speed up the execution of these queries. The database is queried if there are no query results in the cache (with an appropriate notification). If the query is already in the cache, the result is issued directly from the cache.
- Each query should be called three times, and a notification should be displayed where the result is output from the database or cache.
- I am using an empty Redis database and your Postgres database with existing data when testing.

Redis data types – 2 points (+1 bonus)
Use Redis to organize a user's shopping cart, user sessions, etc., and perform all the following operations:
- Strings: 5 insertions (SET), 1 read (GET), 1 update (APPEND, SETRANGE, INCR, …), 1 removal (DEL).
- Lists: 5 insertions (LPUSH, RPUSH, …), 2 different reads (LPOP, RPOP, LINDEX, LRANGE), 1 removal (LREM).
- Sets: 5 insertions (SADD), 2 different reads (SISMEMBER, SUNION, SINTER, SDIFF), 1 removal (SREM), 1 SCARD.
- Sorted sets: 5 insertions (ZADD), 1 read (ZRANGE, ZRANGEBYSCORE), 1 update (ZINCRBY), 1 removal (ZREM, ZREMRANGEBYSCORE), 1 ZCARD or ZCOUNT.
- Hashes: 5 insertions (HSET), 2 different reads (HGET, HMGET, HKEYS, HVALS, …), 1 removal (HDEL).
- Geographic coordinates: 5 insertions, 1 GEOSEARCH, 1 GEODIST.
- Extra: Add the necessary data, create an index, and implement at least one nontrivial search and one nontrivial aggregation query using Redisearch. Describe in natural language the inserted data and queries for it.
- Extra: Add the necessary data, create an index, and implement at least three different nontrivial search queries using RedisJSON. Describe in natural language the inserted data and queries for it.

Add comments to your script using the ECHO command
Describe at least the intended structure of your keys and values in natural language

Submission:
script_sql.txt: text file with SQL queries, script_pg.py with Python script, script.txt: text file with Redis database commands.
Submit to BRUTE these three files and a screenshot of the execution on the NoSQL server

* Execution:

Execute the following shell command to evaluate the whole REDIS script
- ```
cat $ScriptFile | redis-cli -p $Port -a $Password
```
- $ScriptFile is a file with REDIS commands to be executed, i.e. script.txt
- $Password is an initial password (see the first email)

If you cannot connect to your Redis instance remotely, create an ssh tunnel first:
- ```
ssh -L $Port:127.0.0.1:$Port $Username@nosql.felk.cvut.cz
```
- Then from your local device you can connect to Redis on localhost (127.0.0.1) using your port and password, e.g.:
- ```
redis-cli -h 127.0.0.1 -p  $Port -a  $Password
```
Your script will be tested on the server. Before submitting, you should run it and check if it works.

Tools:
- Redis 7.4.0 (installed on the NoSQL server)
References:
Server: nosql.felk.cvut.cz
- Do not forget to execute the homework submission script!
Deadline: Sunday 3. 11. 2024 until 23:59.

HW3: MongoDB

Points: 2.5 (one collection) or 3 (two collections)
Assignment:
- Implement the project block with MongoDB
- Explicitly create one or two collections (for example, one collection for orders or two collections: orders and order_items)
  - I.e., create it using createCollection method
- Insert 10 documents into each collection
  - These documents must be realistic, non-trivial, and with both embedded objects and arrays
  - Interlink the documents using references
  - Use both insertOne and insertMany
- Express update operations:
  - Replace the value of the document
  - Update an individual field of one document
  - Update an individual field in documents by criterion; use a formula or increase/multiply the value
  - Add and remove a field in the document(s)
  - Modify a nested document
  - Replace, add, and delete a single array element
  - Use the upsert mode at least once
- Express at least 5 find queries (with non-trivial selections, i.e. the first parameter cannot be empty)
  - Use at least one logical operator ($and, $or, $not)
  - Use $elemMatch operator on array fields at least once
  - Use both positive and negative projections (each at least once)
  - Use sort modifier
  - The search criteria should contain (each item at least once):
    - matching the value of one or more fields
    - condition on the value of one or more fields (e.g., containing greater than or less than, etc.)
    - comparing dates or parts of dates (for example, all orders for a given date or all orders for a given year)
    - matching the whole array
    - searching for a value in the array
  - Describe the real-world meaning of all your queries in comments
- Express 5 aggregate operations. Use 2 created collections for entities of different types. If necessary, insert more documents into each one of them
  - Use at least once each of $match, $group, $sort, $project (or $addFields), $skip and $limit stages
  - Use at least once each of $sum (or $avg), $count, $min (or $max), $first (or $last) aggregators
  - Use lookup (if you have two collections)
  - Describe the real-world meaning of all your queries in comments
Requirements:
- All queries must be realistic in the context of your project
- Call export LC_ALL=C in case you have difficulties in launching the mongo shell
- Only use your own database when working on the assignment
  - The name of this database must be identical to your login name (f24_login)
- Do not switch to your database when you are inside your script
  - I.e. do not execute USE database and nor db.getSiblingDB('database') commands
  - Specify the intended database outside your script using command line options (see below)
- Note that a different dedicated database will be used when assessing your homework
  - You can assume that this database will be empty at the beginning
- Print the output of your queries (find and aggregate operations)
  - Use db.collection.find().forEach(printjson); approach for this purpose
- Always end queries with a semicolon ;.
- find and aggregate queries should always return a non-empty result.
Submission:
- script.js: JavaScript script with MongoDB database commands
Execution:
- Execute the following shell command to evaluate the whole MongoDB script

mongosh --port 42222 -u login -p password database script.js

$login is your username, e.g. f24_login
$database - database to connect to (same as login)
$password is your password (Use the same password you received for your account at the beginning of the semester)
$file is a file with MongoDB queries to be executed, i.e. script.js
Double dashes before port
Tools:
- MongoDB 7.0.14 (installed on the NoSQL server)
References:
- https://docs.mongodb.com
Server: nosql.felk.cvut.cz
- Do not forget to execute the homework submission script!
Deadline: Sunday 24. 11. 2024 until 23:59

HW4: Cassandra

Points: 3
Assignment:
- Implement the project block with Cassandra
- Define a schema for a table
  - Define at least one column for each of the following data types: tuple, list, set and map
- Insert 10 rows into your table
- Express at least 3 update statements
  - You must perform replace, add and remove primitive operations (all of them) on columns of all collection types (all of them)
  - I.e. you must involve at least altogether 9 different primitive operations on such columns
- Express 3 select statements to retrieve data
  - Use WHERE and ORDER BY clauses at least once (both of them)
  - Use ALLOW FILTERING in a query exactly once
- Create and use at least 1 secondary index
- Create a materialized view
- Write 3 queries to obtain statistical information. Use the materialized view at least once
- Add comments with descriptions to all queries
Requirements:
- Only use your own keyspace when working on the assignment
  - The name of this keyspace must be identical to your login name (f24_login)
  - Do not create this keyspace in your script (assume it already exists)
- Do not switch to your keyspace when you are inside your script
  - I.e. do not execute a USE command to change the active keyspace from within the script
  - Specify the intended keyspace outside your script using command line options (see below)
- Note that a different dedicated keyspace will be used when assessing your homework
  - You can assume that this keyspace will be empty at the beginning
Comments:
- The following error messages can be ignored:

Error from server: code=1300 [Replica(s) failed to execute read]...

Submission:
- script.cql: text file with CQL statements
Execution:
- Execute the following shell command to evaluate the whole CQL script

cqlsh -u $username -p $password -k $KeyspaceName -f $ScriptFile

$KeyspaceName is a name of keyspace that should be used (must already exist), e.g. f241_login
$ScriptFile is a file with CQL queries to be executed, i.e. script.cql
Tools:
- Apache Cassandra 4.1.6 (installed on the NoSQL server)
References:
- The Cassandra Query Language (CQL)
Server: nosql.felk.cvut.cz
- Do not forget to execute the homework submission script!
Deadline: Sunday 1. 12. 2024 until 23:59

HW5: Neo4j

Points: 3
Assignment: Implement the project block with Neo4j
- Insert realistic nodes and relationships into your embedded Neo4j database
  - Use a single CREATE statement for this purpose
  - Insert altogether at least 10 nodes for entities of at least 2 different types (i.e. different labels)
  - Insert altogether at least 15 relationships of at least 2 different types
  - Include properties (both for nodes and relationships)
  - Associate all your nodes with user-defined identifiers
- Create index and constraint, demonstrate and describe their usage
- Express 5 Cypher read query expressions
  - Use at least once MATCH, OPTIONAL MATCH, RETURN, WITH, WHERE, and ORDER BY (sub)clauses (all of them)
  - Use Comparison and logical operators
  - Use pattern conditions (EXISTS, IN)
  - Use string operations (CONTAINS, STARTS WITH)
  - Use regular expressions
  - Use different aggregations at least twice
  - Use size(), collect() at least once
  - Use variable length paths at least once
  - Perform date/time interval queries
  - Use list operations, e.g. use functions like head(), last(), tail(), and reduce()
  - Use CASE expressions
  - Use subqueries
- Find the shortest path between two nodes
- Express 5 Cypher write or read/write query expressions
  - Use at least once CREATE (with MATCH), DELETE, SET, REMOVE, DETACH (sub)clauses (all of them)
    - Update single node
    - Update single relationship
    - Add or remove properties
    - Update multiple nodes on criteria
Requirements:
- Describe the meaning of your Cypher expressions in natural language (via // comment)
Submission: BRUTE and NoSQL server
- queries.cypher: text file with a sequence of Cypher statements (including CREATE) and screenshots of execution
Execution:
- Execute the following shell command to evaluate the whole Neo4j script
  - ```
  cypher-shell -f $ScriptFile
```
- $ScriptFile is a file with Cypher queries to be executed, i.e. queries.cypher
Tools:
- Neo4j 5.23.0 (Java 21 *) (installed on the NoSQL server)
References:
- Cypher query language
- Cypher Reference Card
Deadline: Sunday 15. 12. 2024 until 23:59

HW6: MapReduce

Points: 3 (this assignment is not a part of the project; all students must submit individual files)
Assignment:
- Create an input text file. This file must be large and can contain data from your project or third-party information.
  - Put each entity on a separate line, i.e. assume that each line of the input file yields one input record
  - Organize the actual entity attributes in whatever way you can easily parse
- Implement a non-trivial MapReduce job
  - Choose from aggregation, grouping, filtering or any other general MapReduce usage pattern
  - Use WordCount.java source file as a basis for your own implementation
  - Both the Map and Reduce functions should be non-trivial, each about 10 lines of code
  - It is not necessary to implement the Combine function
- Comment the source file and also provide a description of the problem you are solving
- You may also create a shell script that allows for the execution of your entire MapReduce job
  - I.e. compile source files, deploy input file, execute the actual job, retrieve its result, …
  - However, this script is not supposed to be submitted and serves just for your own convenience
  - Even if you do so, it will not be used for the purpose of homework assessment in any way
Requirements:
- You may split your MapReduce job implementation into multiple Java source files
  - They all must be located in the submission root directory
  - At least MapReduce.java source file with its public MapReduce class is required
  - This class is expected to represent the main class of the entire MapReduce job
- Do not change the way how command line arguments are processed
  - I.e. the only two arguments represent the input and output HDFS locations respectively
- Do not use packages in order to organize your Java source files
- Assume that only hadoop-common-3.1.1.jar and hadoop-mapreduce-client-core-3.1.1.jar libraries will be linked with your project
- Do not submit your Netbeans (or any other) project directory, do not submit Hadoop (or any other) libraries
- Use Java Standard Edition version 7 or newer
- You are free to use your /user/f24_login/ HDFS home directory for debugging
  - Homework assessment will take place in a different dedicated HDFS directory
Submission (NoSQL server):
- readme.txt: description of the input data structure and objective of the MapReduce job
- input.txt: text file with your sample input data (i.e. only one input file is permitted)
- MapReduce.java and possibly additional *.java: Java source files with your MapReduce implementation
- output.txt: expected output of your MapReduce job
- Upload to BRUTE: readme.txt and screenshot (or screenshots) of the execution of your homework on the NoSQL server.
Tools:
- Apache Hadoop 3.4.0 (installed on the NoSQL server)
References:
- HDFS: Hadoop File System Shell commands
- MapReduce: MapReduce Tutorial
- MapReduce: MapReduce Commands Guide
- Hadoop: Hadoop JavaDoc API Documentation
Server: nosql.felk.cvut.cz
- Do not forget to execute the homework submission script and make a screenshot!
Deadline: Sunday 22. 12. 2024 until 23:59

Bonus assignment #1 (BA1)

Analysis of Sharding and Replication Strategies in NoSQL Databases (+3 p.)

For each of the following sharding strategies (lecture 3):
- Hash sharding
- Range-based sharding
- Directory-based sharding
- Consistent hash sharding
- Entity/relationship-based sharding
- Geography-based sharding
- Time-based sharding
- Functional sharding
- Combined sharding.
Complete the following tasks:
- Create a table listing the advantages and potential issues of the strategy.
- Formulate the conditions under which this strategy would be most effective.
- Provide a specific example of data distribution using this strategy.
- Create a graphical illustration of this example (diagram, chart, or drawing).
- Propose a suitable replication strategy to complement this sharding strategy and briefly justify your choice.
Prepare a brief conclusion comparing the effectiveness of the examined strategies in various NoSQL database usage scenarios.
Deadline: Sunday 22. 12. 2024 until 23:59.

Bonus assignment #2 (BA2)

Multi-Database Integration with Redis (+2 p.)

Implement a practical data synchronization system between your primary database (PostgreSQL/MongoDB) and Redis, focusing on performance optimization for frequently accessed data.

Data Migration & Storage (1 point)
Create a Python script that:
- Extracts relevant data from your primary database (PostgreSQL/MongoDB)
- Transforms the data into appropriate Redis data structures (Hash tables or RedisJSON)
- Loads the data into Redis with proper key naming conventions

Search & Query Implementation (0.5 points)
- Implement three high-performance queries using Redisearch and/od RedisJASON features
- Document the performance comparison between Redis and primary database queries

Data Synchronization (0.5 points):
Implement a synchronization mechanism that:
- Detects changes in the primary database
- Updates corresponding Redis data structures

Submission:
- script_bonus.py with Python script.
- Submit to BRUTE script_bonus.py file and a document with a description.

Deadline: Sunday 29. 12. 2024 until 23:59.

Individual Topics

Please enter the team number, username, and topic in the DS2 topics table.
The following topics are not allowed this semester
- Movies, actors, furniture e-shop

Table of Contents