MDP Components#

This section documents the Markov Decision Process (MDP) building blocks in AGILE. These components are defined in agile/rl_env/mdp/ and are composed by task configurations to define complete training environments.

AGILE builds on Isaac Lab’s manager-based architecture: each MDP component (rewards, observations, actions, etc.) is a function or class registered with the corresponding manager. The top-level agile.rl_env.mdp module re-exports both Isaac Lab’s built-in MDP terms and AGILE’s custom additions, so task configs can import everything from a single namespace:

from agile.rl_env import mdp

Rewards#

Reward functions are the core training signal. AGILE organizes rewards into four modules based on their purpose.

Task Rewards (rewards/task_rewards.py)#

Primary rewards that define the training objective for each task.

Velocity tracking – the main locomotion rewards:

Function

Description

track_lin_vel_xy_yaw_frame_exp_weighted_simplified

Track linear velocity (x, y) in the yaw-aligned frame with exponential kernel. Higher commanded velocities receive higher weight.

track_ang_vel_z_world_exp_weighted_simplified

Track angular velocity (yaw) in world frame with magnitude-based weighting.

track_base_height_exp_smooth

Track commanded base height using exponential kernel on smoothed height signal.

base_height_exp

Track a fixed target base height with optional terrain sensor adjustment.

vel_xy_in_threshold

Binary reward: 1.0 if velocity tracking error is within threshold.

Height tracking:

Function

Description

track_base_height

Track commanded height with exponential kernel, active only during stance.

base_height_in_threshold

Binary reward for height within threshold of command.

height_reached

Bonus reward when target height is reached within tolerance.

Stand-up specific:

Function

Description

standing_at_timeout

Bonus reward when the episode times out AND the robot is standing above a minimum height. Encourages both standing up and staying standing.

Trajectory tracking (pick-and-place):

Function

Description

static_at_goal_exp

Reward for being static (low velocities) during the final portion of a trajectory. Uses progress-based gating.

nominal_posture_at_end_exp

Reward matching the final frame’s joint posture at trajectory end.

Gait rewards:

Function

Description

stand_still

Penalize foot lift when velocity commands are near zero (stance mode).

phase_contact

Reward correct foot contact timing using gait phase from command term (XNOR logic).

Most tracking rewards use the exponential kernel pattern:

reward = torch.exp(-error / std**2)

This provides a smooth gradient that is 1.0 at zero error and decays toward 0.0. The std parameter controls the tolerance width.

Tracking Rewards (rewards/tracking_rewards.py)#

Rewards for the pick-and-place trajectory tracking task. These work with the TrackingCommand term.

Function

Description

motion_global_anchor_position_error_exp

Track reference anchor (base) position in world frame.

motion_global_anchor_orientation_error_exp

Track reference anchor orientation using quaternion error.

motion_tracked_joint_pos_error_exp

Track reference joint positions for tracked joints.

motion_object_position_error_exp

Track reference object position in world frame.

motion_object_orientation_error_exp

Track reference object orientation.

hand_object_distance_tracking_exp

Reward hand-object proximity with automatic phase detection. Detects the lift peak in the reference trajectory and decays the reward after placement.

Aesthetic Rewards (rewards/aestetic_rewards.py)#

Style and quality-of-motion rewards that shape how the robot moves, not just whether it achieves the objective.

Body stability:

Class/Function

Description

body_acc_l2

Penalize body linear and angular accelerations using velocity history. Can target root or any specified link.

body_ang_vel_l2

Penalize angular velocity of a body/link (reduces shaking).

flat_body_orientation_exp

Reward flat body orientation via small xy-components of projected gravity.

Foot quality:

Function

Description

feet_roll_l2

Penalize non-flat foot roll angles.

feet_yaw_diff_l2

Penalize yaw difference between left and right feet (reduced during turns).

feet_yaw_mean_vs_base

Penalize foot yaw relative to the base frame.

feet_distance_from_ref

Penalize lateral distance deviation from a reference spacing.

feet_stumble

Penalize high horizontal contact forces on feet.

feet_slip

Penalize horizontal foot velocity when in ground contact.

foot_orientation_l1

L1 penalty on foot roll, pitch, yaw with configurable weights.

impact_velocity_l1

Penalize large impact velocities at foot contact.

jumping

Penalize both feet leaving the ground simultaneously.

equal_foot_force

Reward even force distribution across both feet (1.0 = perfectly balanced).

Stance-mode rewards (active when velocity command is zero):

Function

Description

joint_deviation_exp_if_standing

Penalize joint deviation from defaults, only when standing.

moving_if_standing

Penalize body motion when standing above height threshold.

equal_foot_force_if_null_cmd

Reward balanced foot forces during stance.

Regularization Rewards (rewards/regularization_rewards.py)#

Penalties that encourage smooth, efficient actuation.

Action smoothness:

Function

Description

action_rate_l2

L2 penalty on action change between timesteps.

action_rate_rate_l2

L2 penalty on the second derivative of actions (jerk in action space).

joint_deviation_l2

L2 penalty on joint position deviation from defaults.

Energy and torque efficiency:

Function

Description

torque_limits

Penalize torques exceeding a soft limit (configurable fraction of hardware limit).

incoming_forces_penalty

Penalize large internal joint wrench forces above a threshold.

contact_forces_l2

Penalize contact forces exceeding a threshold (L2 squared).

relax_if_null_cmd

Penalize torque magnitude during stance (zero-velocity commands).

relax_if_null_cmd_exp

Reward low torques during stance using exponential kernel with cached torque limits.

Reward Visualizer (rewards/reward_visualizer.py)#

A real-time visualization tool for monitoring individual reward terms during evaluation. Used by the debug and pick-and-place debug environments. Each reward term is displayed as a bar chart that updates every simulation step.

Actions#

Action terms define the policy’s output space and how it maps to joint commands. Located in agile/rl_env/mdp/actions/.

Joint Position Actions#

JointPositionActionCfg (Isaac Lab built-in)

Standard joint position action with configurable scale and offset. The policy outputs deltas around default joint positions.

DeltaJointPositionAction / DeltaJointPositionActionCfg

Outputs delta joint positions that accumulate over time. Supports per-joint scaling, separate “steady” joints held at defaults, and optional joint limits. Used for manipulation tasks where incremental motion is more natural.

SmoothJointPositionAction / SmoothJointPositionActionCfg

Wraps joint position actions with exponential moving average (EMA) smoothing. Configurable ema_smoothing_param (1.0 = no smoothing).

Random Actions#

RandomPositionAction / RandomActionCfg

Generates random joint positions for upper-body joints during locomotion training. Supports configurable velocity profiles (EMA, linear, trapezoidal) for smooth transitions between random targets, and optional stance-mode behavior.

RandomJointPositionAction / RandomJointPositionActionCfg

Alternative random action with curriculum support. Can gradually increase randomization range during training for progressive difficulty.

Policy Actions#

AgileBasedLowerBodyAction / AgileLowerBodyActionCfg

Runs a pre-trained, frozen RL policy as an action term. Used in the pick-and-place task to provide stable locomotion while training the upper-body policy. Takes the path to a JIT-exported policy model and an observation group name.

GUI Actions#

JointPositionGUIAction / JointPositionGUIActionCfg

Interactive GUI slider control for all joints. Supports mirroring between left/right sides and adjustable PD gains. Used in debug environments.

ObjectPoseGUIAction / ObjectPoseGUIActionCfg

Interactive GUI control for object position and rotation. Used in object debug environments.

Assistance Actions#

HarnessAction / HarnessActionCfg

Simulates a simplified harness by applying external forces and torques to prevent falling. Configurable stiffness, damping, and force/torque limits. Supports height commands for dynamic target height.

LiftAction / LiftActionCfg

Applies upward forces to lift the robot, with configurable ramp-up timing. Used in stand-up training with a curriculum that gradually reduces the assistance. Supports delayed start (start_lifting_time_s) and ramped lifting (lifting_duration_s).

Velocity Profiles#

Located in actions/velocity_profiles/, these define how random upper-body actions transition between targets. All profiles use fully vectorized batch operations and support synchronized joint motion.

Profile

Description

Key Parameters

EMAVelocityProfileCfg

Exponential moving average: pos = alpha * target + (1 - alpha) * current. Smooth convergence to targets.

ema_coefficient_range

LinearVelocityProfileCfg

Constant velocity motion: pos = initial + velocity * time. Predictable, time-based control.

velocity_range (rad/s)

TrapezoidalVelocityProfileCfg

Three-phase motion (acceleration, cruise, deceleration). Physically realistic.

acceleration_range, max_velocity_range, deceleration_range

Usage example:

from agile.rl_env.mdp.actions.velocity_profiles import TrapezoidalVelocityProfileCfg
from agile.rl_env.mdp.actions import RandomActionCfg

action_cfg = RandomActionCfg(
    asset_name="robot",
    joint_names=["joint1", "joint2"],
    sample_range=(0.1, 1.5),
    velocity_profile_cfg=TrapezoidalVelocityProfileCfg(
        acceleration_range=(1.0, 3.0),
        max_velocity_range=(0.5, 2.0),
        synchronize_joints=True,
    ),
    no_random_when_walking=True,
)

To visualize and compare all profiles:

python agile/rl_env/mdp/actions/velocity_profiles/test_profile_comparison.py
python agile/rl_env/mdp/actions/velocity_profiles/test_profile_comparison.py --save-figure

Commands#

Command generators produce the reference signals that the policy must track. Located in agile/rl_env/mdp/commands/.

Velocity Commands#

UniformNullVelocityCommand / UniformNullVelocityCommandCfg

Generates random velocity commands (linear x, y and angular yaw) with a configurable fraction of environments receiving zero-velocity (“stance”) commands. Extends Isaac Lab’s UniformVelocityCommand with:

  • EMA smoothing: Smooth velocity measurement for reward computation.

  • Minimum velocity norm: Commands below this threshold are zeroed out.

  • Bias sampling: Option to sample more low-speed commands for better stance training.

  • Command filtering: Per-axis low-pass filtering for smooth command transitions.

Velocity + Height Commands#

UniformVelocityBaseHeightCommand / UniformVelocityBaseHeightCommandCfg

Extends velocity commands with a base height command. Includes:

  • A minimum walk height: Below this, velocity commands are scaled down to prevent walking while crouched.

  • Squatting threshold: Zeroes velocities when transitioning to low heights.

  • Height sensor integration: Uses a ray caster to measure height above terrain.

Velocity + Height + Gait Commands#

UniformVelocityGaitBaseHeightCommand / UniformVelocityGaitBaseHeightCommandCfg

Adds gait phase information to velocity + height commands. Provides gait_process (current phase in [0, 1]) and gait_frequency signals used by gait-cycle rewards to enforce proper foot timing.

Trajectory Tracking Commands#

TrackingCommand / TrackingCommandCfg

Generates reference poses from pre-recorded YAML trajectory files. Tracks:

  • Anchor body position and orientation (global reference)

  • Joint positions for specified tracked joints

  • Object position and orientation (if object tracking is configured)

  • Automatic peak detection for pick-and-place phase gating

Observations#

Observation terms define what the policy sees. Located in agile/rl_env/mdp/observations/.

Observation Groups#

Task configs define observation groups as nested ObsGroup dataclasses. Common groups:

  • policy: Observations available to the deployed policy (proprioceptive only).

  • critic: Additional observations for the critic during training (can include privileged info).

  • teacher: Privileged observations for teacher policies (e.g., terrain height scans).

Standard Observations#

Most observations come from Isaac Lab’s built-in terms:

Term

Description

base_ang_vel

Base angular velocity in body frame

projected_gravity

Gravity vector projected into body frame (orientation indicator)

joint_pos_rel

Joint positions relative to defaults

joint_vel_rel

Joint velocities relative to defaults

last_action

Previous policy action

generated_commands

Current command vector

height_scan

Terrain height scan from ray caster (privileged)

Custom Observations#

Defined in observations/observations_io.py:

Term

Description

velocity_height_command

Velocity + height command vector for evaluation logging

joint_acc

Joint accelerations

Tracking Observations#

Defined in observations/tracking_observations.py for the pick-and-place task. These provide the current and target states for trajectory tracking.

History Stacking#

Observation groups support history_length to stack multiple timesteps. For example, history_length=5 concatenates the last 5 observation vectors, giving the policy temporal context without recurrence.

Terminations#

Termination conditions end episodes early. Located in agile/rl_env/mdp/terminations.py.

Standard Terminations#

Term

Description

time_out

Episode exceeds maximum length (marked as timeout, not failure).

illegal_ground_contact

Non-foot body contacts ground above force threshold while below minimum height.

illegal_base_height

Base height drops below threshold (adjusted for terrain).

Adaptive Terminations#

Class

Description

fall_from_max_height

Terminates when the robot falls a configurable distance below its peak achieved height. More adaptive than fixed thresholds since it is relative to progress. Clamps maximum trackable height to ignore jumping.

no_height_progress

Terminates if no upward progress is made within a time window. More forgiving than fall_from_max_height – does not punish falling after reaching standing height, only lack of any progress.

standing

Terminates when the robot stands above a height for a specified duration (used as a success condition in stand-up).

Trajectory Terminations#

Function

Description

bad_base_pose

Base position error exceeds threshold from reference trajectory.

bad_base_rotation

Base orientation error exceeds threshold from reference.

bad_joint_pos

Joint position error exceeds threshold from reference.

out_of_bound

Object leaves a defined bounding box (supports reference frame transforms).

link_distance

Distance between two specified links is outside allowed range.

Events#

Event terms handle environment resets and domain randomization. Located in agile/rl_env/mdp/events/.

Reset Events#

Term

Description

reset_joints_around_default

Reset joint positions and velocities to random values around defaults, clipped to soft limits.

reset_root_state_uniform

Reset robot base pose (Isaac Lab built-in).

Randomization Events#

Class

Description

disable_joints

Temporarily disable specified joints for a random duration during an episode. Simulates actuator failures for robustness.

Fallen State Management#

For the stand-up task, specialized event infrastructure manages pre-collected fallen states:

  • FallenStateDataset (events/fallen_state_dataset.py): Manages collection and storage of fallen robot states. Spawns robots, lets them fall, and records the resulting joint positions and velocities.

  • FallenStateCache (events/fallen_state_cache.py): Disk caching with automatic invalidation when terrain configuration changes.

  • reset_from_fallen_dataset (events/reset_from_fallen_dataset.py): Episode reset event that samples from the fallen state dataset instead of simulating falls in real time.

Curriculum#

Curriculum terms adjust training difficulty over time. Located in agile/rl_env/mdp/curriculums/.

Terrain Curriculum (task_curriculum.py)#

initial_pose_curriculum

Progresses robots through terrain difficulty levels based on distance walked. Robots that walk far enough move to harder terrains; robots that underperform move to easier ones.

Effort Limit Curriculum (task_curriculum.py)#

effort_limit_curriculum

Gradually decreases actuator effort limits over training. Starts with inflated limits for easier initial exploration, then decays geometrically toward the real hardware limits. Adjusts both effort limits and saturation effort for DC motors.

effort_limit_curriculum_traveled_distance

Same concept, but triggered by traveled distance rather than another curriculum’s state.

Harness/Lift Curriculum (task_curriculum.py)#

Various curriculum terms reduce assistance forces over training:

  • remove_harness: Gradually reduces harness stiffness and damping to zero.

  • adaptive_lift_curriculum: Reduces lift force based on standing success rate.

Randomization Curriculum (randomization_curriculum.py)#

increase_event_randomization

Increases the range of domain randomization parameters over training. Scales event parameter ranges from an initial fraction to a terminal fraction, based on another curriculum’s progress.

Terrains#

Terrain configurations for rough terrain training. Located in agile/rl_env/mdp/terrains/.

Pre-configured Terrain Sets#

Defined in terrains.py:

Config

Description

ROUGH_TERRAIN_CFG

Full difficulty range with random grid boxes, random rough surfaces, and slope ramps.

LESS_ROUGH_TERRAIN_CFG

Reduced difficulty variant for initial training stages.

Custom Terrain Types#

Defined in hf_terrains.py / hf_terrains_cfg.py:

HfRandomUniformTerrainDifficultyCfg

Height-field based random uniform terrain where noise range scales with difficulty level. Provides smooth progression from flat to highly irregular surfaces.

Actuators#

Custom actuator models that simulate real hardware behavior. Located in agile/rl_env/mdp/actuators/.

Delayed DC Motor (DelayedDCMotor / DelayedDCMotorCfg)#

Extends Isaac Lab’s DCMotor with communication delay simulation:

  • Random delay: Each environment gets a random delay between min_delay and max_delay timesteps, sampled at reset.

  • Delay buffers: Separate buffers for position, velocity, and effort signals.

  • Torque-speed curve: Models the DC motor characteristic where available torque decreases with joint velocity.

Delayed Implicit Actuator (DelayedImplicitActuator / DelayedImplicitActuatorCfg)#

Extends Isaac Lab’s ImplicitActuator with the same delay mechanism. Used for locomotion-height tasks where the actuator model is simpler but delay is still important for sim-to-real transfer.

Tip

Actuator delay is critical for sim-to-real transfer. Real robots have non-negligible communication latency between the policy computer and joint controllers. Training with randomized delays makes the policy robust to this variation.

Symmetry#

Morphological symmetry augmentation for data-efficient training. Located in agile/rl_env/mdp/symmetry/.

Purpose#

Bipedal robots have left-right symmetry: a mirrored observation should produce a mirrored action. AGILE leverages this by augmenting the training data with symmetry-transformed samples, effectively doubling the data efficiency.

Robot-Specific Implementations#

lr_mirror_G1 (symmetry_g1.py)

Left-right mirror function for the Unitree G1 robot. Transforms observations and actions by swapping left/right joint indices and negating lateral quantities.

lr_mirror_T1 (symmetry_t1.py)

Left-right mirror function for the Booster T1 robot.

Observation Mirror Primitives (observations.py)#

Building blocks used by robot-specific mirror functions:

Function

Description

lr_mirror_base_lin_vel

Negate y-component of base linear velocity.

lr_mirror_base_ang_vel

Negate x and z components of base angular velocity.

lr_mirror_projected_gravity

Negate y-component of projected gravity.

mirror_velocity_commands

Negate y and yaw velocity commands.

mirror_height_scan_left_right

Swap left and right height scan regions.

mirror_gait_cycle_commands

Swap gait phase for left and right legs.

Stability Terms (stability_terms.py)#

Utility functions used by multiple MDP components for computing stability-related quantities (e.g., center of mass, support polygon membership).

Utility Functions (utils.py)#

Shared helper functions used across MDP components:

Function

Description

get_robot_cfg

Get robot asset with proper configuration, creating defaults if needed.

get_contact_sensor_cfg

Get contact sensor with proper body ID resolution.

get_body_velocities_and_forces

Extract body velocities and contact forces for specified bodies.

transform_to_asset_frame

Transform world positions to asset-local coordinates.

transform_to_body_frame

Transform world positions to a body’s local frame.

compute_asset_aabb

Compute axis-aligned bounding box for an asset.

get_joint_indices

Get joint indices for a specified body part (lower, upper, or whole body).