===== T4a-rl - Reinforcement Learning ======

|**Due date**        | January 03, 2026, 23:59 PST |
|**Deadline**        | January 11, 2026, 23:59 PST |
|**Points**          | 5  |
|**Label in BRUTE**  | t4a-rl |
|**Files to submit** | archive with ''evaluator.py'' (mandatory), ''agent.msh'' (mandatory), and ''report.pdf'' (optional) |
|**Resources**       | {{ :courses:uir:hw:b4m36uir_24_t4_resource_pack.zip |}}|

----

==== Introduction ====

The inchworm robot considered for this assignment is inspired by snakes' and caterpillars' morphology.
It consists of four servomotors acting in the same plane to produce a 1D motion using the flexible scale-like contact friction pads with directional frictional anisotropy, meaning there is a preferred direction where the friction coefficient is lower than the friction coefficient in the opposite direction.
This frictional anisotropy enables the robot to anchor down one end while the other reaches a new pose. 

{{ :courses:uir:hw:t4-inchworms-comparison.jpg?600 |}}

The inchworm locomotion is organized in a repeated motion called motion //gait//.
There are several hand-designed gaits, such as the //sliding gait//, where frictional anisotropy is utilized for basic locomotion consisting of two gait phases (contraction and extension).
The sliding gait can be further extended by the weight manipulation in the //balancing gait// via additional balancing phases (balance forward and backwards).
Finally, the contact friction pads can be turned off and on using the rigid robot frame instead of the flexible scales when moving forward, as exploited in the //stepping gait// that adds numerous new gait phases (front up, front down, back up and back down).
The well-performing gait should integrate the approaches to the locomotion from these gaits.

| {{ :courses:uir:hw:t4-sliding-gait.gif?600 |}} | 
| Sliding gait |

| {{ :courses:uir:hw:t4-balancing-gait.gif?600 |}} |
| Balancing Gait example 1 |
| {{ :courses:uir:hw:t4-balancing-gait-tuned.gif?600 |}} |
| Balancing Gait example 2 |

| {{ :courses:uir:hw:t4-stepping-gait.gif?600 |}} |
| Stepping Gait example 1 |
| {{ :courses:uir:hw:t4-stepping-gait-tuned.gif?600 |}} |
| Stepping Gait example 2 |

==== Assignment ====

In this task, you aim to design a suitable gait for an inchworm robot using reinforcement learning.
To do so, two functions will be implemented: a reward function and an absorbing state definition.
Both functions use the inchworm robot state interface, outlined in ''inchworm_interface.py'' and described further [[courses:uir:hw:t4a-rl#Inchworm Interface Description|below]].

A reward function computation (''compute_reward(self, inchworm_state: InchwormInterface) -> float'' in ''evaluator.py'') uses the inchworm robot state to reward it by assigning a float value, i.e. desirable robot states should be rewarded by a higher value than undesirable ones.
Desirable states enable a robot to move forward as fast as possible, using only scales and bumpers to touch the ground; hence, for example, rewarding inchworm robot state based on the forward velocity in centimetres per second and penalizing it by some fixed value for touching the ground with other parts then the scales and bumpers. For additional insight into possible reward function components, see [[courses:uir:hw:t4a-rl#Observations and Hints|Observations and Hints]] in the Appendix.

An absorbing state classification (''is_absorbing(self, inchworm_state: InchwormInterface) -> bool'' in ''evaluator.py'' as well) uses the inchworm robot state to decide whether the robot state should be assumed as absorbing, i.e. hard or impossible for a robot to recover or otherwise undesirable.
For additional insight into possible absorbing state classification, see [[courses:uir:hw:t4a-rl#Observations and Hints|Observations and Hints]] in the Appendix.

The designed functions are used during the policy learning that uses the Soft Actor-Critic algorithm implemented by the Mushroom-RL framework to evaluate the inchworm states during the simulation execution realised by the MuJoCo simulator.
After each simulation step, the robot state is evaluated by the reward function followed by the absorbing state detection function to steer the training algorithm.

=== Inchworm Robot Parts ===

The robot frame consists of 4 servomotors, referred to as ''servo-0'' to ''servo-3'' (numbering starts from the robot front), two stiff bumpers called ''bumper-front'' and ''bumper-back'' and depicted in blue, two contact scales called ''scales-front'' and ''scales-back'' and depicted in grey, two short brackets called ''bracket-front'' and ''bracket-back'' and depicted in yellow, and finally the double-sided bracket called ''bracket-middle'' depicted in green.

{{ :courses:uir:hw:t4-inchworm-render.png?800 |}}

=== Inchworm Interface Description ===

The state of the inchworm robot can be accessed via six functions: three focus on the robot parts, two focus on the joint states, and one focus on the collisions.
The robot is located on the infinite xy plane with a forward direction identical to the x-axis and joint axes parallel to the y-axis.
The simulation is set to operate in meters.

The ''get_part_position(self, part_name : str) -> "np.array | None"'' takes a part name (defined in the Inchworm Robot Parts section using monospace text) and returns the part position vector [x,y,z] in the world coordinates.

<code python>
inchworm_state.get_part_position("bracket-front") # gets the position of the front bracket
inchworm_state.get_part_position("servo-0") # returns the position of the front servomotor
</code>

The ''get_part_rotation(self, part_name : str, degrees : bool = True) -> "np.array | None"'' takes a part name and returns the part XYZ Euler angles in [[https://en.wikipedia.org/wiki/Euler_angles#Conventions_by_intrinsic_rotations|Instrinsic convention]] within the part frame.
Note that all parts coordinate frames follow the [[https://en.wikipedia.org/wiki/Denavit%E2%80%93Hartenberg_parameters|D-H notion]], i.e. all parts coordinate frames are aligned with the associated joints such that the x-axis always points to the closest joint axis and the joint axes are identical to the z-axis.
The returned values are in degrees but can be changed to radians by setting the optional argument ''degrees=False''.

<code python>
inchworm_state.get_part_position("bracket-middle") # gets the rotation of the middle bracket
inchworm_state.get_part_position("servo-1") # returns the rotation of the second servomotor
</code>

The ''get_part_velocity(self, part_name : str) -> "np.array | None"'' takes a part name and returns the vector of angular velocities around respective axes in radians per second, followed by the linear velocity in meters per second.
<code python>
inchworm_state.get_part_position("bumper-back") # gets the velocity of the middle bracket
inchworm_state.get_part_position("servo-3") # returns the velocity of the back-most servomotor
</code>

The ''get_joint_position(self, joint_name : str, degrees : bool = True) -> "float | None"'' takes a joint name (''joint-0'', ''joint-1'', ''joint-2'', ''joint-3'') and returns the joint position (rotation around its shaft). The joint rotation is in degrees but can be changed to radians by setting the optional argument ''degrees=False''.

The ''get_joint_velocity(self, joint_name : str) -> "float | None"'' takes a joint name and returns joint rotation speed in radians per second.

Finally, the ''is_touching(self, part_name_1 : str, part_name_2 : str) -> "bool | None"'' takes two parts' names and returns whether they are touching each other (colliding). Moreover, the ''ground'' can be used to check whether a part is touching the ground, and the ''no-touch'' can be used to decide whether any non-contacting part is touching another provided part.
<code python>
inchworm_state.is_touching("ground", "bumper-front") # check for collision between ground and front bumper
inchworm_state.is_touching("ground", "no-touch") # check for collision between ground and non-contacting parts
</code>

Note that any of the abovementioned functions returns ''None'' whenever the given name is invalid.

{{ :courses:uir:hw:t4-inchworm-mujoco.png?500 |}} 

=== Training ===

Assuming that the (virtual) environment is set up as described in the Appendix below, the ''main.py'' enables interaction with the reinforcement learning simulation setup using the program arguments.

To enumerate available program arguments, run:
<code bash>
python3 main.py --help
</code>

For initial training with visualization enabled, run:
<code bash>
python3 main.py
</code>
The default arguments run 200 training epochs consisting of a training process of 5 minutes (300 seconds) followed by 30 seconds of performance evaluation.
A new directory labelled based on the current time is created for each run with a copy of relevant sources.
After each epoch, the current agent's policy is saved as ''checkpoint-N.msh'' with the additional metadata stored in ''checkpoint-N.metadata'' so the training can be interrupted and continued later.
The checkpoint metadata contains information about performance during the training epoch: the (discounted) //reward//, total distance, the real-time epoch //duration// in seconds, real-time factor, and approximate evaluation in BRUTE (see [[courses:uir:hw:t4a-rl#Evaluation|Evaluation]]).

For running the training headless (without visualization during evaluation), run:
<code bash>
python3 main.py --render=""
</code>

For observing already trained agent policy located in ''path/to/agent.msh'', run:
<code bash>
python3 main.py --load_agent="true" --agent_path_to_load="path/to/agent.msh" --render="true" --n_epochs=0
</code>

<note important>Note that each training epoch runs on an average-performing CPU with a real-time factor of 2, meaning that each second of simulation time takes about 2 seconds of real-time, and the policy usually takes about 6 to 8 hours of simulation time to train.</note>

To further train selected agent policy located in ''path/to/agent.msh'', run:
<code bash>
python3 main.py --load_agent="true" --agent_path_to_load="path/to/agent.msh" --render=""
</code>


=== Evaluation ===

The proposed policy and the absorption state classification (''evaluator.py'') and the trained agent's policy (''agent.msh'') must be submitted.
The trained agent's policy is run for 30 seconds by the evaluation system.
The distance travelled by a backmost servomotor (''servo-3'') is measured and averaged over the last ten simulation steps to determine the distance travelled by the robot.
  * A non-negative distance is awarded by a single point (denoted ''brute_points_for_forward'' in the metadata files).
  * Moreover, each 5 cm travelled is awarded an additional point (denoted ''brute_points_for_distance'' in the metadata files).
  * On top of that, if only the scales and bumpers touched the ground, an additional point is awarded (denoted ''brute_points_for_touching'' in the metadata files).
The abovementioned points are summarized, and up to 5 points are assigned (denoted ''brute_total_points'' in the metadata files) to the simulation run.
Ten simulation runs are executed, and the median of the points achieved is assigned as the final score.

Aside from the policy evaluation, the student can provide the report.pdf to summarize the suggested approach and report on its performance. The report is awarded by up to a single point.

<note important>As the task was not prepared before the beginning of the semester and the lecture covering reinforcement learning was cancelled, a **1 point is required** for this task to be considered finished.</note>

==== Appendix ====

=== Installation (Ubuntu LTS >= 20.04) ===

  * The assignment setup was designed for Python 3.10 or Python 3.11 so verify the installed version first.

<code bash>
python3 --version
</code>

  * Install the required version, if necessary.

<code bash>
sudo apt update
sudo sudo add-apt-repository ppa:deadsnakes/ppa -y
sudo apt install python3.10-full -y
</code>

  * To separate the assignment from other Python packages managed by pip, install the virtual environment package and create a new virtual environment.

<code bash>
sudo apt install python3-pip -y
pip3 install virtualenv --upgrade
virtualenv inchworm_rl_venv --python=python3.10
</code>

  * Enter the newly created virtual environment and install the required dependencies.

<code bash>
source inchworm_rl_venv/bin/activate && pip3 install -r requirements.txt
</code>
    
The previously mentioned steps are summarized in the provided ''install-venv.sh''.

<note important>Venv and Conda cause compatibility issues on machines already running Conda; hence, use Conda instead or deactivate Conda completely when using Venv.
</note>

=== Familiarizing with Assignment Setup ===

To familiarize yourself with the simulator setup, it is recommended that you use the MuJoCo simulator outside the reinforcement learning pipeline by following these steps.
  - Download MuJoCo [[https://github.com/google-deepmind/mujoco/releases| GitHub releases]] and unpack it
  - Open the MuJoCo simulator (run ''bin/simulate.sh'' in the MuJoCo unpacked archive root directory).
  - Add ''inchworm.xml'' from the ''model'' directory by dragging and dropping it into the MuJoCo window. 

Then you are free to
  * Explore the joint's positions in the second //Control// card in the right column,
  * Show visual elements by the ''4'' key,
  * Hide the collision elements by the ''1'' key.
Note that visual elements are purely visual and play no role during training.

Examine the robot part names under //Rendering// tab in the left column.
  * Show part names by selecting //Label -> Geom// used by ''is_touching'',
  * Show part names coordinate frames by selecting //Frame -> Body// used to get position, rotation, and velocity,
  * Show part names by selecting //Label -> Body// used to get position, rotation, and velocity.


=== Observations and Hints ===

<note important>Firstly, designing a reward function structure by defining its components and assigning similar weights is advised. The components' weights can be modified in the later stages of reward function design if the observed behaviour is undesirable.</note>

The inchworm locomotion aims to be as fast as possible. Hence, the forward motion should be promoted, for example, using the average forward velocity on the forward x-axis direction of the first and last servomotors (''servo-0'' and ''servo-3'') as a part of the reward function.

A desired locomotion uses only scales and bumpers when moving forward. Hence, any other robot parts in contact with the ground should be penalized, for example, using the fixed penalty and ''is_touching'' interface function. Moreover, when a state approaches the undesired states, an additional proportional penalty can be applied, for example, when the ''joint-1'' or ''joint-2'' gets closer to the zero angle from the respective (positive or negative) direction.

It can be observed in hand-tuned gaits that scales and bumpers in //approximately horizontal// configuration are not moving forward as they are expected to anchor the robot in place. Hence, their movement in the approximately horizontal position should be penalized, for example, proportionally to the forward speed.

It can also be observed that the bumper, scales, and respective servo move forward whenever they are //significantly rotated// (in other words, when only the bumper is touching the ground). Hence, this forward movement, when rotated, should be rewarded.

Finally, in a well-performing gait, the transition between //approximately horizontal// and //significantly rotated// scales and bumpers orientations should be as fast as possible to avoid wasting time transitioning between them. Hence, the reward (or penalty) should reflect the ''joint-0'' and ''joint-3'' speed when the respective bumper/scales are outside of either state to promote prompt transition.

<note tip>Remember that the list of hints mentioned above does not cover all possible reward function components, and you are encouraged to make your own observations when designing a reward function. </note>

The inchworm robot states expected to be the absorption state can be selected based on a significant deviation from the expected robot configuration during its motion, such as
  * The middle bracket touches the ground while the front or back bumper/scale is significantly elevated.
  * The middle bracket, the front or back servomotor, is significantly twisted 
  * Servomotors reaching the angles outside of the reasonable range of about 70 degrees
  * Or a combination of them.

<note tip>When a specific inchworm configuration is observed to cause undesired behaviour during the training process, consider generalizing it and adding it as an absorption state. </note>

----

/*
==== Changelog ====

| 16-12-2024 15:20 | minor | JK: adding link to MuJoco releases into Familiarizing with Assignment Setup, adding notes on Conda and Venv  |
| 16-12-2024 17:10 | minor | JK: adding MuJoCo installation instruction into Familiarizing with Assignment Setup |

*/