Evaluation#

AGILE provides two evaluation paths – Isaac Lab and Sim2MuJoCo – that follow the same design principle. Both paths share an identical workflow: load a trained policy, apply commands (deterministic schedules, sweeps, or random), roll out the policy in simulation, and save trajectory data for analysis. They use the same YAML eval config format, produce the same Parquet output schema, and work with the same plotting and analysis tools.

The two paths differ in simulator backend and feature set:

Aspect

Isaac Lab

Sim2MuJoCo

Script

scripts/eval.py

scripts/sim2mujoco_eval.py

Simulator

Isaac Sim (GPU)

MuJoCo (CPU)

Parallel envs

Yes (N envs)

Single env

Eval config

Shared YAML format

Shared YAML format

Output format

Parquet + metadata.json

Parquet + metadata.json

Metrics (metrics.json)

Yes

No

HTML reports

Yes

No

Interactive control

No

Keyboard teleop

Random commands

Yes

Yes

Observation noise

Yes

Yes

Isaac Lab Evaluation#

# Evaluate a trained policy
python scripts/eval.py \
    --task Velocity-T1-v0 \
    --num_envs 32 \
    --checkpoint /path/to/model.pt \
    --run_evaluation

# With trajectory saving and HTML report
python scripts/eval.py \
    --task Velocity-T1-v0 \
    --num_envs 32 \
    --checkpoint /path/to/model.pt \
    --run_evaluation \
    --save_trajectories \
    --generate_report

# With a deterministic evaluation scenario
python scripts/eval.py \
    --task Velocity-Height-G1-v0 \
    --num_envs 16 \
    --checkpoint /path/to/model.pt \
    --run_evaluation \
    --eval_config agile/algorithms/evaluation/configs/examples/x_velocity_sweep.yaml

CLI Options#

Option

Description

--run_evaluation

Enable PolicyEvaluator

--save_trajectories

Save trajectory data to parquet files

--trajectory_fields

Specific fields to save (default: all)

--generate_report

Generate HTML report (requires --save_trajectories)

--eval_config

Path to YAML scenario config for deterministic testing

Output Structure#

logs/rsl_rl/<experiment_name>/
  trajectories/
    episode_000.parquet
    episode_001.parquet
    ...
  metrics.json
  reports/          # if --generate_report
    index.html
    episodes/
      episode_000.html
      ...

Sim2MuJoCo Evaluation#

The Sim2MuJoCo path runs policies in MuJoCo for cross-simulator validation. See Sim-to-MuJoCo Transfer for setup instructions (policy export, MJCF acquisition).

# Interactive keyboard control
python scripts/sim2mujoco_eval.py \
    --checkpoint /path/to/policy.pt \
    --config /path/to/config.yaml \
    --mjcf /path/to/scene.xml \
    --duration 30.0

# Deterministic evaluation (same YAML config format as Isaac Lab)
python scripts/sim2mujoco_eval.py \
    --checkpoint /path/to/policy.pt \
    --config /path/to/config.yaml \
    --mjcf /path/to/scene.xml \
    --eval-config agile/sim2mujoco/configs/x_velocity_sweep.yaml \
    --save-data --no-viewer

# Random commands (reproducible with seed)
python scripts/sim2mujoco_eval.py \
    --checkpoint /path/to/policy.pt \
    --config /path/to/config.yaml \
    --mjcf /path/to/scene.xml \
    --random-commands all --random-interval 2.0 --random-seed 42 \
    --duration 50.0 --save-data --no-viewer

CLI Options#

Option

Description

--checkpoint

Path to policy checkpoint (.pt or .onnx)

--config

Path to exported I/O descriptor YAML

--mjcf

Path to MuJoCo MJCF file (overrides config default)

--duration

Simulation duration in seconds

--eval-config

Path to YAML eval config (deterministic command schedule)

--save-data

Save trajectory data to Parquet files

--output-dir

Custom output directory for saved data

--random-commands

Randomize commands: field names (vx, vy, wz, height) or all

--random-interval

Seconds between random resamples (default: 2.0)

--random-seed

RNG seed for reproducible random commands

--noise-scale

Observation noise scale (0=off, 1=match training, >1=stress test)

--pd-scale

Scale factor for PD gains (use 0.3–0.5 for stability)

--no-viewer

Disable MuJoCo viewer (headless mode)

--no-real-time

Disable real-time pacing (runs as fast as possible)

Command Modes#

Three mutually exclusive command modes are available:

  • Keyboard control (default): Interactive teleoperation via the MuJoCo viewer. Arrow keys for movement, U/O for turning, Page Up/Down for height.

  • Eval config (--eval-config): Deterministic command schedules from YAML files, using the same format as Isaac Lab evaluation. Duration is set from the config’s episode_length_s.

  • Random commands (--random-commands): Uniform random resampling at a fixed interval. Specify individual fields (vx, vy, wz, height) or all. Use --random-seed for reproducibility.

Note

--eval-config and --random-commands are mutually exclusive. Keyboard control is automatically disabled when either is active or when --no-viewer is set.

Output Structure#

logs/sim2mujoco/<task>/<eval_config>_<timestamp>/
  trajectories/
    metadata.json
    episode_000.parquet

The Parquet schema matches the Isaac Lab output: joint_pos_{i}, joint_vel_{i}, joint_acc_{i}, root_pos_{i}, root_lin_vel_robot_{i}, commands_{i}, actions_{i}, plus metadata columns (episode_id, env_id, frame_idx, timestep).

Deterministic Scenario Configs#

Both evaluation paths use the same YAML config format for deterministic testing. Configs define controlled command sequences instead of random commands, enabling systematic and reproducible evaluation.

Isaac Lab configs live in agile/algorithms/evaluation/configs/examples/; Sim2MuJoCo configs live in agile/sim2mujoco/configs/. The format is identical – only task-specific values (sweep ranges, durations) differ.

Two specification modes are available:

Sweep Mode#

Uniform time intervals cycling through a list of values:

evaluation:
  task_name: "Velocity-Height-G1-Dev-v0"
  num_envs: 4
  episode_length_s: 50.0
  num_episodes: 1

  environments:
    - env_ids: [0]
      name: "x_velocity_test"
      sweep:
        interval: 5.0
        commands:
          base_velocity:
            lin_vel_x: [-1.0, 0.0, 1.0]
            lin_vel_y: 0.0
            ang_vel_z: 0.0
            base_height: 0.75

Schedule Mode#

Explicit time-based command sequences for complex maneuvers:

environments:
  - env_ids: [0]
    name: "complex_maneuver"
    schedule:
      - time: 0.0
        commands:
          base_velocity:
            lin_vel_x: 0.5
            lin_vel_y: 0.0
            ang_vel_z: 0.0
            base_height: 0.75
      - time: 10.0
        commands:
          base_velocity:
            lin_vel_x: 1.0
            lin_vel_y: 0.0
            ang_vel_z: 0.0
            base_height: 0.75

Multi-Environment Testing#

Assign different tests to different environments (Isaac Lab only – Sim2MuJoCo runs a single env):

environments:
  - env_ids: [0, 1]
    name: "test_a"
    sweep: ...

  - env_ids: [2]
    name: "test_b"
    schedule: ...

Unassigned environments use random commands (training behavior).

Pre-built Scenarios#

Isaac Lab examples in agile/algorithms/evaluation/configs/examples/:

Config

Description

x_velocity_sweep.yaml

Forward/backward walking

y_velocity_sweep.yaml

Lateral movement

yaw_rate_sweep.yaml

Turning

height_sweep.yaml

Height control

multi_env_capability_test.yaml

All capabilities in parallel (one per env)

explicit_schedule_example.yaml

Complex maneuver sequence

Sim2MuJoCo examples in agile/sim2mujoco/configs/:

Config

Description

x_velocity_sweep.yaml

Forward/backward velocity sweep

y_velocity_sweep.yaml

Lateral velocity sweep

yaw_rate_sweep.yaml

Turning rate sweep

height_sweep.yaml

Base height sweep (velocity+height tasks)

All base_velocity commands must specify all 4 fields (lin_vel_x, lin_vel_y, ang_vel_z, base_height). Commands are automatically clamped to valid ranges defined in the task config.

Tip

Start with num_envs: 1 to validate configs. Use longer episodes than training (e.g., 50s vs 30s) for thorough testing.

HTML Reports#

Interactive HTML reports with tracking analysis and per-joint plots. Reports are generated by the Isaac Lab evaluation path. The Sim2MuJoCo path does not generate reports directly, but its Parquet output is compatible with the plotting API for custom analysis (see Analyzing Trajectories below).

Generation#

# Automatic (during evaluation)
python scripts/eval.py --task <task_name> --checkpoint path/to/model.pt \
    --run_evaluation --save_trajectories --generate_report

# Manual (after evaluation)
python agile/algorithms/evaluation/generate_report.py \
    --log_dir logs/evaluation/task_datetime

# Specific or failed episodes only
python agile/algorithms/evaluation/generate_report.py \
    --log_dir logs/evaluation/task_datetime \
    --episodes failed

# Specific episode IDs
python agile/algorithms/evaluation/generate_report.py \
    --log_dir logs/evaluation/task_datetime \
    --episodes 0,3,5

Report Contents#

  • Summary Dashboard (index.html): Success rate, sortable episode table with search/filter, tracking error summary plots

  • Episode Pages (episodes/episode_XXX.html): Tracking performance (lin_vel_x, lin_vel_y, ang_vel_z, height), all joints organized by body part (upper/lower) with collapsible sections, joint position and velocity limits shown, interactive Plotly plots (zoom, pan, hover)

Analyzing Trajectories (Python/Jupyter)#

The plotting API works with trajectory data from both evaluation paths, since they share the same Parquet schema and metadata format.

import sys
sys.path.insert(0, "agile/algorithms/evaluation")
from plotting import load_episode, load_metadata, plot_joint_trajectories
import matplotlib.pyplot as plt

# Works with either Isaac Lab or Sim2MuJoCo output directories
metadata = load_metadata("logs/rsl_rl/experiment")
df = load_episode("logs/rsl_rl/experiment", episode_id=0)

fig, axes = plot_joint_trajectories(
    df,
    joint_names=['left_hip_yaw_joint', 'right_knee_joint'],
    metadata=metadata,
    show_limits=True,
)
plt.show()

Evaluation Framework Internals#

The evaluation framework lives in agile/algorithms/evaluation/.

PolicyEvaluator#

The main evaluation class (evaluator.py) that collects trajectory data from policy rollouts and computes metrics:

  • Requires an eval observation group providing joint positions, velocities, accelerations, root state, commands, and actions

  • Handles terminal state observations correctly by using previous-frame data for terminated environments

  • Supports configurable joint groups for per-body-part metrics

  • Optionally saves trajectory data to Parquet files for offline analysis

MotionMetricsAnalyzer#

Computes and aggregates motion quality metrics (motion_metrics_analyzer.py):

  • Mean/max joint acceleration: Smoothness indicator (lower is better)

  • Mean/max acceleration rate (jerk): Jerkiness indicator

  • Mean/max joint velocity: Activity level

  • All metrics computed for whole body and per joint group

  • Separate statistics for all episodes vs. successful episodes only

  • Results saved as JSON with grouped metrics

TrajectoryReportGenerator#

Generates interactive HTML reports from saved trajectory data (report_generator.py):

  • Uses Plotly for interactive, zoomable plots

  • Supports filtering by success/failure status

  • Works standalone without Isaac Sim (only requires pandas, plotly, jinja2)