MDP Components#
This section documents the Markov Decision Process (MDP) building blocks in AGILE. These
components are defined in agile/rl_env/mdp/ and are composed by task configurations to
define complete training environments.
AGILE builds on Isaac Lab’s manager-based architecture: each MDP component (rewards,
observations, actions, etc.) is a function or class registered with the corresponding
manager. The top-level agile.rl_env.mdp module re-exports both Isaac Lab’s built-in
MDP terms and AGILE’s custom additions, so task configs can import everything from a
single namespace:
from agile.rl_env import mdp
Rewards#
Reward functions are the core training signal. AGILE organizes rewards into four modules based on their purpose.
Task Rewards (rewards/task_rewards.py)#
Primary rewards that define the training objective for each task.
Velocity tracking – the main locomotion rewards:
Function |
Description |
|---|---|
|
Track linear velocity (x, y) in the yaw-aligned frame with exponential kernel. Higher commanded velocities receive higher weight. |
|
Track angular velocity (yaw) in world frame with magnitude-based weighting. |
|
Track commanded base height using exponential kernel on smoothed height signal. |
|
Track a fixed target base height with optional terrain sensor adjustment. |
|
Binary reward: 1.0 if velocity tracking error is within threshold. |
Height tracking:
Function |
Description |
|---|---|
|
Track commanded height with exponential kernel, active only during stance. |
|
Binary reward for height within threshold of command. |
|
Bonus reward when target height is reached within tolerance. |
Stand-up specific:
Function |
Description |
|---|---|
|
Bonus reward when the episode times out AND the robot is standing above a minimum height. Encourages both standing up and staying standing. |
Trajectory tracking (pick-and-place):
Function |
Description |
|---|---|
|
Reward for being static (low velocities) during the final portion of a trajectory. Uses progress-based gating. |
|
Reward matching the final frame’s joint posture at trajectory end. |
Gait rewards:
Function |
Description |
|---|---|
|
Penalize foot lift when velocity commands are near zero (stance mode). |
|
Reward correct foot contact timing using gait phase from command term (XNOR logic). |
Most tracking rewards use the exponential kernel pattern:
reward = torch.exp(-error / std**2)
This provides a smooth gradient that is 1.0 at zero error and decays toward 0.0.
The std parameter controls the tolerance width.
Tracking Rewards (rewards/tracking_rewards.py)#
Rewards for the pick-and-place trajectory tracking task. These work with the
TrackingCommand term.
Function |
Description |
|---|---|
|
Track reference anchor (base) position in world frame. |
|
Track reference anchor orientation using quaternion error. |
|
Track reference joint positions for tracked joints. |
|
Track reference object position in world frame. |
|
Track reference object orientation. |
|
Reward hand-object proximity with automatic phase detection. Detects the lift peak in the reference trajectory and decays the reward after placement. |
Aesthetic Rewards (rewards/aestetic_rewards.py)#
Style and quality-of-motion rewards that shape how the robot moves, not just whether it achieves the objective.
Body stability:
Class/Function |
Description |
|---|---|
|
Penalize body linear and angular accelerations using velocity history. Can target root or any specified link. |
|
Penalize angular velocity of a body/link (reduces shaking). |
|
Reward flat body orientation via small xy-components of projected gravity. |
Foot quality:
Function |
Description |
|---|---|
|
Penalize non-flat foot roll angles. |
|
Penalize yaw difference between left and right feet (reduced during turns). |
|
Penalize foot yaw relative to the base frame. |
|
Penalize lateral distance deviation from a reference spacing. |
|
Penalize high horizontal contact forces on feet. |
|
Penalize horizontal foot velocity when in ground contact. |
|
L1 penalty on foot roll, pitch, yaw with configurable weights. |
|
Penalize large impact velocities at foot contact. |
|
Penalize both feet leaving the ground simultaneously. |
|
Reward even force distribution across both feet (1.0 = perfectly balanced). |
Stance-mode rewards (active when velocity command is zero):
Function |
Description |
|---|---|
|
Penalize joint deviation from defaults, only when standing. |
|
Penalize body motion when standing above height threshold. |
|
Reward balanced foot forces during stance. |
Regularization Rewards (rewards/regularization_rewards.py)#
Penalties that encourage smooth, efficient actuation.
Action smoothness:
Function |
Description |
|---|---|
|
L2 penalty on action change between timesteps. |
|
L2 penalty on the second derivative of actions (jerk in action space). |
|
L2 penalty on joint position deviation from defaults. |
Energy and torque efficiency:
Function |
Description |
|---|---|
|
Penalize torques exceeding a soft limit (configurable fraction of hardware limit). |
|
Penalize large internal joint wrench forces above a threshold. |
|
Penalize contact forces exceeding a threshold (L2 squared). |
|
Penalize torque magnitude during stance (zero-velocity commands). |
|
Reward low torques during stance using exponential kernel with cached torque limits. |
Reward Visualizer (rewards/reward_visualizer.py)#
A real-time visualization tool for monitoring individual reward terms during evaluation. Used by the debug and pick-and-place debug environments. Each reward term is displayed as a bar chart that updates every simulation step.
Actions#
Action terms define the policy’s output space and how it maps to joint commands. Located
in agile/rl_env/mdp/actions/.
Joint Position Actions#
JointPositionActionCfg(Isaac Lab built-in)Standard joint position action with configurable scale and offset. The policy outputs deltas around default joint positions.
DeltaJointPositionAction/DeltaJointPositionActionCfgOutputs delta joint positions that accumulate over time. Supports per-joint scaling, separate “steady” joints held at defaults, and optional joint limits. Used for manipulation tasks where incremental motion is more natural.
SmoothJointPositionAction/SmoothJointPositionActionCfgWraps joint position actions with exponential moving average (EMA) smoothing. Configurable
ema_smoothing_param(1.0 = no smoothing).
Random Actions#
RandomPositionAction/RandomActionCfgGenerates random joint positions for upper-body joints during locomotion training. Supports configurable velocity profiles (EMA, linear, trapezoidal) for smooth transitions between random targets, and optional stance-mode behavior.
RandomJointPositionAction/RandomJointPositionActionCfgAlternative random action with curriculum support. Can gradually increase randomization range during training for progressive difficulty.
Policy Actions#
AgileBasedLowerBodyAction/AgileLowerBodyActionCfgRuns a pre-trained, frozen RL policy as an action term. Used in the pick-and-place task to provide stable locomotion while training the upper-body policy. Takes the path to a JIT-exported policy model and an observation group name.
GUI Actions#
JointPositionGUIAction/JointPositionGUIActionCfgInteractive GUI slider control for all joints. Supports mirroring between left/right sides and adjustable PD gains. Used in debug environments.
ObjectPoseGUIAction/ObjectPoseGUIActionCfgInteractive GUI control for object position and rotation. Used in object debug environments.
Assistance Actions#
HarnessAction/HarnessActionCfgSimulates a simplified harness by applying external forces and torques to prevent falling. Configurable stiffness, damping, and force/torque limits. Supports height commands for dynamic target height.
LiftAction/LiftActionCfgApplies upward forces to lift the robot, with configurable ramp-up timing. Used in stand-up training with a curriculum that gradually reduces the assistance. Supports delayed start (
start_lifting_time_s) and ramped lifting (lifting_duration_s).
Velocity Profiles#
Located in actions/velocity_profiles/, these define how random upper-body actions
transition between targets. All profiles use fully vectorized batch operations and
support synchronized joint motion.
Profile |
Description |
Key Parameters |
|---|---|---|
|
Exponential moving average: |
|
|
Constant velocity motion: |
|
|
Three-phase motion (acceleration, cruise, deceleration). Physically realistic. |
|
Usage example:
from agile.rl_env.mdp.actions.velocity_profiles import TrapezoidalVelocityProfileCfg
from agile.rl_env.mdp.actions import RandomActionCfg
action_cfg = RandomActionCfg(
asset_name="robot",
joint_names=["joint1", "joint2"],
sample_range=(0.1, 1.5),
velocity_profile_cfg=TrapezoidalVelocityProfileCfg(
acceleration_range=(1.0, 3.0),
max_velocity_range=(0.5, 2.0),
synchronize_joints=True,
),
no_random_when_walking=True,
)
To visualize and compare all profiles:
python agile/rl_env/mdp/actions/velocity_profiles/test_profile_comparison.py
python agile/rl_env/mdp/actions/velocity_profiles/test_profile_comparison.py --save-figure
Commands#
Command generators produce the reference signals that the policy must track. Located in
agile/rl_env/mdp/commands/.
Velocity Commands#
UniformNullVelocityCommand/UniformNullVelocityCommandCfgGenerates random velocity commands (linear x, y and angular yaw) with a configurable fraction of environments receiving zero-velocity (“stance”) commands. Extends Isaac Lab’s
UniformVelocityCommandwith:EMA smoothing: Smooth velocity measurement for reward computation.
Minimum velocity norm: Commands below this threshold are zeroed out.
Bias sampling: Option to sample more low-speed commands for better stance training.
Command filtering: Per-axis low-pass filtering for smooth command transitions.
Velocity + Height Commands#
UniformVelocityBaseHeightCommand/UniformVelocityBaseHeightCommandCfgExtends velocity commands with a base height command. Includes:
A minimum walk height: Below this, velocity commands are scaled down to prevent walking while crouched.
Squatting threshold: Zeroes velocities when transitioning to low heights.
Height sensor integration: Uses a ray caster to measure height above terrain.
Velocity + Height + Gait Commands#
UniformVelocityGaitBaseHeightCommand/UniformVelocityGaitBaseHeightCommandCfgAdds gait phase information to velocity + height commands. Provides
gait_process(current phase in [0, 1]) andgait_frequencysignals used by gait-cycle rewards to enforce proper foot timing.
Trajectory Tracking Commands#
TrackingCommand/TrackingCommandCfgGenerates reference poses from pre-recorded YAML trajectory files. Tracks:
Anchor body position and orientation (global reference)
Joint positions for specified tracked joints
Object position and orientation (if object tracking is configured)
Automatic peak detection for pick-and-place phase gating
Observations#
Observation terms define what the policy sees. Located in agile/rl_env/mdp/observations/.
Observation Groups#
Task configs define observation groups as nested ObsGroup dataclasses. Common groups:
policy: Observations available to the deployed policy (proprioceptive only).critic: Additional observations for the critic during training (can include privileged info).teacher: Privileged observations for teacher policies (e.g., terrain height scans).
Standard Observations#
Most observations come from Isaac Lab’s built-in terms:
Term |
Description |
|---|---|
|
Base angular velocity in body frame |
|
Gravity vector projected into body frame (orientation indicator) |
|
Joint positions relative to defaults |
|
Joint velocities relative to defaults |
|
Previous policy action |
|
Current command vector |
|
Terrain height scan from ray caster (privileged) |
Custom Observations#
Defined in observations/observations_io.py:
Term |
Description |
|---|---|
|
Velocity + height command vector for evaluation logging |
|
Joint accelerations |
Tracking Observations#
Defined in observations/tracking_observations.py for the pick-and-place task. These
provide the current and target states for trajectory tracking.
History Stacking#
Observation groups support history_length to stack multiple timesteps. For example,
history_length=5 concatenates the last 5 observation vectors, giving the policy
temporal context without recurrence.
Terminations#
Termination conditions end episodes early. Located in agile/rl_env/mdp/terminations.py.
Standard Terminations#
Term |
Description |
|---|---|
|
Episode exceeds maximum length (marked as timeout, not failure). |
|
Non-foot body contacts ground above force threshold while below minimum height. |
|
Base height drops below threshold (adjusted for terrain). |
Adaptive Terminations#
Class |
Description |
|---|---|
|
Terminates when the robot falls a configurable distance below its peak achieved height. More adaptive than fixed thresholds since it is relative to progress. Clamps maximum trackable height to ignore jumping. |
|
Terminates if no upward progress is made within a time window. More forgiving than |
|
Terminates when the robot stands above a height for a specified duration (used as a success condition in stand-up). |
Trajectory Terminations#
Function |
Description |
|---|---|
|
Base position error exceeds threshold from reference trajectory. |
|
Base orientation error exceeds threshold from reference. |
|
Joint position error exceeds threshold from reference. |
|
Object leaves a defined bounding box (supports reference frame transforms). |
|
Distance between two specified links is outside allowed range. |
Events#
Event terms handle environment resets and domain randomization. Located in
agile/rl_env/mdp/events/.
Reset Events#
Term |
Description |
|---|---|
|
Reset joint positions and velocities to random values around defaults, clipped to soft limits. |
|
Reset robot base pose (Isaac Lab built-in). |
Randomization Events#
Class |
Description |
|---|---|
|
Temporarily disable specified joints for a random duration during an episode. Simulates actuator failures for robustness. |
Fallen State Management#
For the stand-up task, specialized event infrastructure manages pre-collected fallen states:
FallenStateDataset(events/fallen_state_dataset.py): Manages collection and storage of fallen robot states. Spawns robots, lets them fall, and records the resulting joint positions and velocities.FallenStateCache(events/fallen_state_cache.py): Disk caching with automatic invalidation when terrain configuration changes.reset_from_fallen_dataset(events/reset_from_fallen_dataset.py): Episode reset event that samples from the fallen state dataset instead of simulating falls in real time.
Curriculum#
Curriculum terms adjust training difficulty over time. Located in
agile/rl_env/mdp/curriculums/.
Terrain Curriculum (task_curriculum.py)#
initial_pose_curriculumProgresses robots through terrain difficulty levels based on distance walked. Robots that walk far enough move to harder terrains; robots that underperform move to easier ones.
Effort Limit Curriculum (task_curriculum.py)#
effort_limit_curriculumGradually decreases actuator effort limits over training. Starts with inflated limits for easier initial exploration, then decays geometrically toward the real hardware limits. Adjusts both effort limits and saturation effort for DC motors.
effort_limit_curriculum_traveled_distanceSame concept, but triggered by traveled distance rather than another curriculum’s state.
Harness/Lift Curriculum (task_curriculum.py)#
Various curriculum terms reduce assistance forces over training:
remove_harness: Gradually reduces harness stiffness and damping to zero.adaptive_lift_curriculum: Reduces lift force based on standing success rate.
Randomization Curriculum (randomization_curriculum.py)#
increase_event_randomizationIncreases the range of domain randomization parameters over training. Scales event parameter ranges from an initial fraction to a terminal fraction, based on another curriculum’s progress.
Terrains#
Terrain configurations for rough terrain training. Located in agile/rl_env/mdp/terrains/.
Pre-configured Terrain Sets#
Defined in terrains.py:
Config |
Description |
|---|---|
|
Full difficulty range with random grid boxes, random rough surfaces, and slope ramps. |
|
Reduced difficulty variant for initial training stages. |
Custom Terrain Types#
Defined in hf_terrains.py / hf_terrains_cfg.py:
HfRandomUniformTerrainDifficultyCfgHeight-field based random uniform terrain where noise range scales with difficulty level. Provides smooth progression from flat to highly irregular surfaces.
Actuators#
Custom actuator models that simulate real hardware behavior. Located in
agile/rl_env/mdp/actuators/.
Delayed DC Motor (DelayedDCMotor / DelayedDCMotorCfg)#
Extends Isaac Lab’s DCMotor with communication delay simulation:
Random delay: Each environment gets a random delay between
min_delayandmax_delaytimesteps, sampled at reset.Delay buffers: Separate buffers for position, velocity, and effort signals.
Torque-speed curve: Models the DC motor characteristic where available torque decreases with joint velocity.
Delayed Implicit Actuator (DelayedImplicitActuator / DelayedImplicitActuatorCfg)#
Extends Isaac Lab’s ImplicitActuator with the same delay mechanism. Used for
locomotion-height tasks where the actuator model is simpler but delay is still important
for sim-to-real transfer.
Tip
Actuator delay is critical for sim-to-real transfer. Real robots have non-negligible communication latency between the policy computer and joint controllers. Training with randomized delays makes the policy robust to this variation.
Symmetry#
Morphological symmetry augmentation for data-efficient training. Located in
agile/rl_env/mdp/symmetry/.
Purpose#
Bipedal robots have left-right symmetry: a mirrored observation should produce a mirrored action. AGILE leverages this by augmenting the training data with symmetry-transformed samples, effectively doubling the data efficiency.
Robot-Specific Implementations#
lr_mirror_G1(symmetry_g1.py)Left-right mirror function for the Unitree G1 robot. Transforms observations and actions by swapping left/right joint indices and negating lateral quantities.
lr_mirror_T1(symmetry_t1.py)Left-right mirror function for the Booster T1 robot.
Observation Mirror Primitives (observations.py)#
Building blocks used by robot-specific mirror functions:
Function |
Description |
|---|---|
|
Negate y-component of base linear velocity. |
|
Negate x and z components of base angular velocity. |
|
Negate y-component of projected gravity. |
|
Negate y and yaw velocity commands. |
|
Swap left and right height scan regions. |
|
Swap gait phase for left and right legs. |
Stability Terms (stability_terms.py)#
Utility functions used by multiple MDP components for computing stability-related quantities (e.g., center of mass, support polygon membership).
Utility Functions (utils.py)#
Shared helper functions used across MDP components:
Function |
Description |
|---|---|
|
Get robot asset with proper configuration, creating defaults if needed. |
|
Get contact sensor with proper body ID resolution. |
|
Extract body velocities and contact forces for specified bodies. |
|
Transform world positions to asset-local coordinates. |
|
Transform world positions to a body’s local frame. |
|
Compute axis-aligned bounding box for an asset. |
|
Get joint indices for a specified body part (lower, upper, or whole body). |