courses:uir:hw:t4a-rl [CourseWare Wiki]

T4a-rl - Reinforcement Learning

Due date	January 11, 2025, 23:59 PST
Deadline	January 11, 2025, 23:59 PST
Points	5
Label in BRUTE	t4a-rl
Files to submit	archive with `evaluator.py` (mandatory), `agent.msh` (mandatory), and `report.pdf` (optional)
Resources	b4m36uir_24_t4_resource_pack.zip

Introduction

The inchworm robot considered for this assignment is inspired by snakes' and caterpillars' morphology. It consists of four servomotors acting in the same plane to produce a 1D motion using the flexible scale-like contact friction pads with directional frictional anisotropy, meaning there is a preferred direction where the friction coefficient is lower than the friction coefficient in the opposite direction. This frictional anisotropy enables the robot to anchor down one end while the other reaches a new pose.

The inchworm locomotion is organized in a repeated motion called motion gait. There are several hand-designed gaits, such as the sliding gait, where frictional anisotropy is utilized for basic locomotion consisting of two gait phases (contraction and extension). The sliding gait can be further extended by the weight manipulation in the balancing gait via additional balancing phases (balance forward and backwards). Finally, the contact friction pads can be turned off and on using the rigid robot frame instead of the flexible scales when moving forward, as exploited in the stepping gait that adds numerous new gait phases (front up, front down, back up and back down). The well-performing gait should integrate the approaches to the locomotion from these gaits.

Sliding gait

Balancing Gait example 1

Balancing Gait example 2

Stepping Gait example 1

Stepping Gait example 2

Assignment

In this task, you aim to design a suitable gait for an inchworm robot using reinforcement learning. To do so, two functions will be implemented: a reward function and an absorbing state definition. Both functions use the inchworm robot state interface, outlined in inchworm_interface.py and described further below.

A reward function computation (compute_reward(self, inchworm_state: InchwormInterface) → float in evaluator.py) uses the inchworm robot state to reward it by assigning a float value, i.e. desirable robot states should be rewarded by a higher value than undesirable ones. Desirable states enable a robot to move forward as fast as possible, using only scales and bumpers to touch the ground; hence, for example, rewarding inchworm robot state based on the forward velocity in centimetres per second and penalizing it by some fixed value for touching the ground with other parts then the scales and bumpers. For additional insight into possible reward function components, see Observations and Hints in the Appendix.

An absorbing state classification (is_absorbing(self, inchworm_state: InchwormInterface) → bool in evaluator.py as well) uses the inchworm robot state to decide whether the robot state should be assumed as absorbing, i.e. hard or impossible for a robot to recover or otherwise undesirable. For additional insight into possible absorbing state classification, see Observations and Hints in the Appendix.

The designed functions are used during the policy learning that uses the Soft Actor-Critic algorithm implemented by the Mushroom-RL framework to evaluate the inchworm states during the simulation execution realised by the MuJoCo simulator. After each simulation step, the robot state is evaluated by the reward function followed by the absorbing state detection function to steer the training algorithm.

Inchworm Robot Parts

The robot frame consists of 4 servomotors, referred to as servo-0 to servo-3 (numbering starts from the robot front), two stiff bumpers called bumper-front and bumper-back and depicted in blue, two contact scales called scales-front and scales-back and depicted in grey, two short brackets called bracket-front and bracket-back and depicted in yellow, and finally the double-sided bracket called bracket-middle depicted in green.

Inchworm Interface Description

The state of the inchworm robot can be accessed via six functions: three focus on the robot parts, two focus on the joint states, and one focus on the collisions. The robot is located on the infinite xy plane with a forward direction identical to the x-axis and joint axes parallel to the y-axis. The simulation is set to operate in meters.

The get_part_position(self, part_name : str) → “np.array | None” takes a part name (defined in the Inchworm Robot Parts section using monospace text) and returns the part position vector [x,y,z] in the world coordinates.

inchworm_state.get_part_position("bracket-front") # gets the position of the front bracket
inchworm_state.get_part_position("servo-0") # returns the position of the front servomotor

The get_part_rotation(self, part_name : str, degrees : bool = True) → “np.array | None” takes a part name and returns the part XYZ Euler angles in Instrinsic convention within the part frame. Note that all parts coordinate frames follow the D-H notion, i.e. all parts coordinate frames are aligned with the associated joints such that the x-axis always points to the closest joint axis and the joint axes are identical to the z-axis. The returned values are in degrees but can be changed to radians by setting the optional argument degrees=False.

inchworm_state.get_part_position("bracket-middle") # gets the rotation of the middle bracket
inchworm_state.get_part_position("servo-1") # returns the rotation of the second servomotor

The get_part_velocity(self, part_name : str) → “np.array | None” takes a part name and returns the vector of angular velocities around respective axes in radians per second, followed by the linear velocity in meters per second.

inchworm_state.get_part_position("bumper-back") # gets the velocity of the middle bracket
inchworm_state.get_part_position("servo-3") # returns the velocity of the back-most servomotor

The get_joint_position(self, joint_name : str, degrees : bool = True) → “float | None” takes a joint name (joint-0, joint-1, joint-2, joint-3) and returns the joint position (rotation around its shaft). The joint rotation is in degrees but can be changed to radians by setting the optional argument degrees=False.

The get_joint_velocity(self, joint_name : str) → “float | None” takes a joint name and returns joint rotation speed in radians per second.

Finally, the is_touching(self, part_name_1 : str, part_name_2 : str) → “bool | None” takes two parts' names and returns whether they are touching each other (colliding). Moreover, the ground can be used to check whether a part is touching the ground, and the no-touch can be used to decide whether any non-contacting part is touching another provided part.

inchworm_state.is_touching("ground", "bumper-front") # check for collision between ground and front bumper
inchworm_state.is_touching("ground", "no-touch") # check for collision between ground and non-contacting parts

Note that any of the abovementioned functions returns None whenever the given name is invalid.

Training

Assuming that the (virtual) environment is set up as described in the Appendix below, the main.py enables interaction with the reinforcement learning simulation setup using the program arguments.

To enumerate available program arguments, run:

python3 main.py --help

For initial training with visualization enabled, run:

python3 main.py

The default arguments run 200 training epochs consisting of a training process of 5 minutes (300 seconds) followed by 30 seconds of performance evaluation. A new directory labelled based on the current time is created for each run with a copy of relevant sources. After each epoch, the current agent's policy is saved as checkpoint-N.msh with the additional metadata stored in checkpoint-N.metadata so the training can be interrupted and continued later. The checkpoint metadata contains information about performance during the training epoch: the (discounted) reward, total distance, the real-time epoch duration in seconds, real-time factor, and approximate evaluation in BRUTE (see Evaluation).

For running the training headless (without visualization during evaluation), run:

python3 main.py --render=""

For observing already trained agent policy located in path/to/agent.msh, run:

python3 main.py --load_agent="true" --agent_path_to_load="path/to/agent.msh" --render="true" --n_epochs=0

Note that each training epoch runs on an average-performing CPU with a real-time factor of 2, meaning that each second of simulation time takes about 2 seconds of real-time, and the policy usually takes about 6 to 8 hours of simulation time to train.

To further train selected agent policy located in path/to/agent.msh, run:

python3 main.py --load_agent="true" --agent_path_to_load="path/to/agent.msh" --render=""

Evaluation

The proposed policy and the absorption state classification (evaluator.py) and the trained agent's policy (agent.msh) must be submitted. The trained agent's policy is run for 30 seconds by the evaluation system. The distance travelled by a backmost servomotor (servo-3) is measured and averaged over the last ten simulation steps to determine the distance travelled by the robot.

A non-negative distance is awarded by a single point (denoted brute_points_for_forward in the metadata files).
Moreover, each 5 cm travelled is awarded an additional point (denoted brute_points_for_distance in the metadata files).
On top of that, if only the scales and bumpers touched the ground, an additional point is awarded (denoted brute_points_for_touching in the metadata files).

The abovementioned points are summarized, and up to 5 points are assigned (denoted brute_total_points in the metadata files) to the simulation run. Ten simulation runs are executed, and the median of the points achieved is assigned as the final score.

Aside from the policy evaluation, the student can provide the report.pdf to summarize the suggested approach and report on its performance. The report is awarded by up to a single point.

As the task was not prepared before the beginning of the semester and the lecture covering reinforcement learning was cancelled, a 1 point is required for this task to be considered finished.

Appendix

Installation (Ubuntu LTS >= 20.04)

The assignment setup was designed for Python 3.10 or Python 3.11 so verify the installed version first.

python3 --version

Install the required version, if necessary.

sudo apt update
sudo sudo add-apt-repository ppa:deadsnakes/ppa -y
sudo apt install python3.10-full -y

To separate the assignment from other Python packages managed by pip, install the virtual environment package and create a new virtual environment.

sudo apt install python3-pip -y
pip3 install virtualenv --upgrade
virtualenv inchworm_rl_venv --python=python3.10

Enter the newly created virtual environment and install the required dependencies.

source inchworm_rl_venv/bin/activate && pip3 install -r requirements.txt

The previously mentioned steps are summarized in the provided install-venv.sh.

Venv and Conda cause compatibility issues on machines already running Conda; hence, use Conda instead or deactivate Conda completely when using Venv.

Familiarizing with Assignment Setup

To familiarize yourself with the simulator setup, it is recommended that you use the MuJoCo simulator outside the reinforcement learning pipeline by following these steps.

Download MuJoCo GitHub releases and unpack it
Open the MuJoCo simulator (run bin/simulate.sh in the MuJoCo unpacked archive root directory).
Add inchworm.xml from the model directory by dragging and dropping it into the MuJoCo window.

Then you are free to

Explore the joint's positions in the second Control card in the right column,
Show visual elements by the 4 key,
Hide the collision elements by the 1 key.

Note that visual elements are purely visual and play no role during training.

Examine the robot part names under Rendering tab in the left column.

Show part names by selecting Label → Geom used by is_touching,
Show part names coordinate frames by selecting Frame → Body used to get position, rotation, and velocity,
Show part names by selecting Label → Body used to get position, rotation, and velocity.

Observations and Hints

Firstly, designing a reward function structure by defining its components and assigning similar weights is advised. The components' weights can be modified in the later stages of reward function design if the observed behaviour is undesirable.

The inchworm locomotion aims to be as fast as possible. Hence, the forward motion should be promoted, for example, using the average forward velocity on the forward x-axis direction of the first and last servomotors (servo-0 and servo-3) as a part of the reward function.

A desired locomotion uses only scales and bumpers when moving forward. Hence, any other robot parts in contact with the ground should be penalized, for example, using the fixed penalty and is_touching interface function. Moreover, when a state approaches the undesired states, an additional proportional penalty can be applied, for example, when the joint-1 or joint-2 gets closer to the zero angle from the respective (positive or negative) direction.

It can be observed in hand-tuned gaits that scales and bumpers in approximately horizontal configuration are not moving forward as they are expected to anchor the robot in place. Hence, their movement in the approximately horizontal position should be penalized, for example, proportionally to the forward speed.

It can also be observed that the bumper, scales, and respective servo move forward whenever they are significantly rotated (in other words, when only the bumper is touching the ground). Hence, this forward movement, when rotated, should be rewarded.

Finally, in a well-performing gait, the transition between approximately horizontal and significantly rotated scales and bumpers orientations should be as fast as possible to avoid wasting time transitioning between them. Hence, the reward (or penalty) should reflect the joint-0 and joint-3 speed when the respective bumper/scales are outside of either state to promote prompt transition.

Remember that the list of hints mentioned above does not cover all possible reward function components, and you are encouraged to make your own observations when designing a reward function.

The inchworm robot states expected to be the absorption state can be selected based on a significant deviation from the expected robot configuration during its motion, such as

The middle bracket touches the ground while the front or back bumper/scale is significantly elevated.
The middle bracket, the front or back servomotor, is significantly twisted
Servomotors reaching the angles outside of the reasonable range of about 70 degrees
Or a combination of them.

When a specific inchworm configuration is observed to cause undesired behaviour during the training process, consider generalizing it and adding it as an absorption state.

Changelog

16-12-2024 15:20	minor	JK: adding link to MuJoco releases into Familiarizing with Assignment Setup, adding notes on Conda and Venv
16-12-2024 17:10	minor	JK: adding MuJoCo installation instruction into Familiarizing with Assignment Setup

Table of Contents