2025-09-15 18:24:23 +08:00
2025-09-14 23:06:43 +08:00
2025-09-12 21:53:41 +08:00
2025-09-12 21:53:41 +08:00
2025-09-15 18:06:42 +08:00
2025-09-12 21:53:41 +08:00
2025-09-12 21:53:41 +08:00
2025-09-12 21:53:41 +08:00
2025-09-15 18:06:42 +08:00
2025-09-15 18:06:42 +08:00
2025-09-12 21:53:41 +08:00
2025-09-12 21:53:41 +08:00
2025-09-15 18:08:05 +08:00
2025-09-15 18:24:23 +08:00

UnifoLM-WMA-0: A World-Model-Action (WMA) Framework under UnifoLM Family

Project Page | Models | Dataset

🌎English | 🇨🇳中文

UnifoLM-WMA-0 is Unitrees open-source world-modelaction architecture spanning multiple types of robotic embodiments, designed specifically for general-purpose robot learning. Its core component is a world-model capable of understanding the physical interactions between robots and the environments. This world-model provides two key functions: (a) Simulation Engine operates as an interactive simulator to generate synthetic data for robot learning; (b) Policy Enhancement connects with an action head and, by predicting future interaction processes with the world-model, further optimizes decision-making performance.

🦾 Real Robot Deployment

Note: the top-right window shows the world models prediction of future environmental changes.

📑 Opensource Plan

  • Training
  • Inference
  • Checkpoints
  • Deployment

⚙️ Installation

conda create -n unifolm-wma python==3.10.18
conda activate unifolma

conda install pinocchio=3.2.0 -c conda-forge -y
conda install ffmpeg=7.1.1 -c conda-forge

git clone --recurse-submodules https://github.com/unitreerobotics/unifolm-world-model-action.git

# If you already downloaded the repo:
cd unifolm-world-model-action
git submodule update --init --recursive

pip install -e .

cd external/dlimp
pip install -e .

🧰 Model Checkpoints

Model Description Link
\text{UnifoLM-WMA-0}_{Base} Fintuned on Open-X dataset. HuggingFace
\text{UnifoLM-WMA-0}_{Dual} Fintuned on five Unitree opensource dataset in both decision-making and simulation modes. HuggingFace

🛢️ Dataset

In our experiments, we consider the following three opensource dataset:

Dataset Robot Link
Z1_StackBox Unitree Z1 Huggingface
Z1_DualArm_StackBox Unitree Z1 Huggingface
Z1_DualArm_StackBox_V2 Unitree Z1 Huggingface
Z1_DualArm_Cleanup_Pencils Unitree Z1 Huggingface
G1_Pack_Camera Unitree G1 Huggingface

To train on your own dataset, first to have the data following the Huggingface LeRobot dataset format. Assume the datasets source directory structure is as follows:

source_dir/
    ├── dataset1_name
    ├── dataset2_name
    ├── dataset3_name
    └── ...

Then, convert a dataset to the required format using the command below:

cd prepare_data
python prepare_training_data.py \
    --source_dir /path/to/your/source_dir \
    --target_dir /path/to/save/the/converted/data \
    --dataset_name "dataset1_name" \
    --robot_name "a tag of the robot in the dataset" # e.g, Unitree Z1 Robot Arm or Unitree G1 Robot with Gripper.

The resulting data structure (Note: model training only supports input from the main-view camera. If the dataset includes multiple views, remove the corresponding values from the data_dir column in the CSV file.

target_dir/
    ├── videos
    │     ├──dataset1_name
    │     │   ├──camera_view_dir
    │     │       ├── 0.mp4
    │     │       ├── 1.mp4
    │     │       └── ...
    │     └── ...
    ├── transitions
    │    ├── dataset1_name
    │        ├── meta_data
    │        ├── 0.h5
    │        ├── 1.h5
    │        └── ...
    └──  dataset1_name.csv

🚴‍♂️ Training

A. Our training strategy is outlined as follows:

  • Step 1: Fine-tune a video generation model as the world model using the Open-X dataset;
  • Step 2: Post-train \text{UnifoLM-WMA} in decision-making mode on the downstream task dataset;
  • Step 3: Post-train \text{UnifoLM-WMA} in simulation mode on the downstream task dataset.

Note: If you only require \text{UnifoLM-WMA} to operate in a single mode, you may skip the corresponding step.

B. To conduct training on a single or multiple datasets, please follow the steps below:

  • Step 1: The maximum DoF is assumed to be 16, if you have more than 16 DoF, update agent_state_dim and agent_action_dim in configs/train/config.yaml ;
  • Step 2: Set up the input shapes for each modality in configs/train/meta.json;
  • Step 3: Configure the training parameters in configs/train/config.yaml. For the pretrained_checkpoint, we recommend using the checkpoint " \text{UnifoLM-WMA-0}_{Base} " fine-tuned on the Open-X dataset;
    model:
        pretrained_checkpoint: /path/to/pretrained/checkpoint;
        ...
        dicision_making_only: True # Train the world model only in decision-making mode. If False, jointly train it in both decision-making and simulation modes.
        ...
    data:
        ...
        train:
            ...
            data_dir: /path/to/training/dataset/directory
        dataset_and_weights: # list the name of each dataset below and make sure the summation of weights is 1.0
            dataset1_name: 0.2
            dataset2_name: 0.2
            dataset3_name: 0.2
            dataset4_name: 0.2
            dataset5_name: 0.2
    
  • Step 4: Setup experiment_name, save_root variables in scripts/train.sh;
  • Step 5: Lanuch the training with the command:
bash scripts/train.sh

🌏 Inference under the Interactive Simulation Mode

To run the world model in an interactive simulation mode, follow these steps:

  • Step 1: (Skip this step if you just would like to test using the examples we provided) Prepare your own prompt following the format used in the examples/world_model_interaction_prompts:
    world_model_interaction_prompts/
      ├── images
      │    ├── dataset1_name
      │    │       ├── 0.png     # Image prompt
      │    │       └── ...
      │    └── ...
      ├── transitions
      │    ├── dataset1_name
      │    │       ├── meta_data # Used for normalization
      │    │       ├── 0.h       # Robot state and action data; in interaction mode,
      │    │       │             # only used to retrieve the robot state corresponding 
      │    │       │             # to the image prompt
      │    │       └── ...
      │    └── ...
      ├──  dataset1_name.csv     # File for loading image prompts, text instruction and corresponding robot states
      └── ...
    
  • Step 2: Specify the correct paths for pretrained_checkpoint(e.g, \text{UnifoLM-WMA-0}_{Dual}) and data_dir in configs/inference/world_model_interaction.yaml
  • Step 3: Set the paths for checkpoint, res_dir and prompt_dir in scripts/run_world_model_interaction.sh, and specify all the dataset's name in datasets=(...). Then, lanuch the inference with the command:
    bash scripts/run_world_model_interaction.sh
    

📝 Codebase Architecture

Here's a high-level overview of the project's code structure and core components:

unitree-world-model/
    ├── assets                      # Media assets such as GIFs, images, and demo videos
    ├── configs                     # Configuration files for training and inference
    │    ├── inference
    │    └──  train
    ├── examples                    # Example inputs and prompts for running inference
    ├── external                    # External packages
    ├── prepare_data                # Scripts for dataset preprocessing and format conversion
    ├── scripts                     # Main scripts for training, evaluation, and deployment
    ├── src
    │    ├──unitree_worldmodel      # Core Python package for the Unitree world model
    │    │      ├── data            # Dataset loading, transformations, and dataloaders
    │    │      ├── models          # Model architectures and backbone definitions
    │    │      ├── modules         # Custom model modules and components
    │    │      └──  utils          # Utility functions and common helpers

🙏 Acknowledgement

Lots of code are inherieted from DynamiCrafter, Diffusion Policy, ACT and HPT.

Description
No description provided
Readme CC-BY-SA-4.0 4.8 GiB
Languages
Python 97.6%
Shell 2.4%