Zhuoyang Liu | 刘卓洋

Hi there! I'm a third-year undergraduate student at Peking University, majoring in Computer Science. Currently, I am a visiting student at BAIR at UC Berkeley, advised by Prof. Trevor Darrell. Prior to this, I was a research intern at the HMI Lab at Peking University, advised by Prof. Shanghang Zhang. My research interests lie in the application of multimodal large models, specifically VLA models, in robot manipulation.

Email  /  Scholar  /  Twitter  /  Github  /  WeChat

profile photo

Research

I'm interested in Computer Vision, Robot Learning and Embodied Large Multimodal Models. My research focuses on how to enhance large multimodal models to better reason about the physical world and develop effective task planning. Some papers are highlighted.

TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation
Qinwen Xu*, Jiaming Liu*, Rui Zhou*, Shaojun Shi*, Nuowei Han*, Zhuoyang Liu,
Chenyang Gu, Shuo Gu, Yang Yue, Gao Huang, Wenzhao Zheng, Sirui Han, Peng Jia,
Shanghang Zhang
arXiv, 2026
project page / arXiv

TwinRL is a digital twin-real-world collaborative RL framework designed to scale and guide exploration for VLA models.

LaST0: Latent Spatio-Temporal Chain-of-Thought for Robotic Vision-Language-Action Model
Zhuoyang Liu*, Jiaming Liu*, Hao Chen*, Jiale Yu, Ziyu Guo, Chengkai Hou,
Chenyang Gu, Xiangju Mi, Renrui Zhang, Kun Wu, Zhengping Che, Jian Tang,
Pheng-Ann Heng, Shanghang Zhang
arXiv, 2026
project page / arXiv

A VLA model that enables efficient reasoning before acting through a Latent Spatio-Temporal Chain-of-Thought (CoT), capturing fine-grained physical and robotic dynamics that are often difficult to verbalize.

ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation
Chenyang Gu*, Jiaming Liu*, Hao Chen*, Runzhong Huang*, Qingpo Wuwu, Zhuoyang Liu,
Xiaoqi Li, Ying Li, Renrui Zhang, Peng Jia, Pheng-Ann Heng, Shanghang Zhang
CVPR, 2026
project page / arXiv / video

A unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution.

DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action
Zhen Fang*, Zhuoyang Liu*, Jiaming Liu, Hao Chen, Yu Zeng, Shiting Huang,
Zehui Chen, Lin Chen, Shanghang Zhang, Feng Zhao
arXiv, 2025
project page / arXiv

DualVLA improves action performance through carefully designed post-training while preserving the reasoning ability.

MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation
Zhuoyang Liu*, Jiaming Liu*, Jiadong Xu, Nuowei Han, Chenyang Gu, Hao Chen,
Kaichen Zhou, Renrui Zhang, Kai Chin Hsieh, Kun Wu, Zhengping Che, Jian Tang,
Shanghang Zhang
ICRA, 2026
project page / arXiv / code

A multisensory language-action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling.

RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence
Chengkai Hou*, Kun Wu*, Jiaming Liu*, Zhengping Che*, Di Wu*, Fei Liao*, Guangrun Li*,
Jingyang He*, Qiuxuan Feng*, Zhao Jin*, Chenyang Gu, Zhuoyang Liu, Nuowei Han,
Xiangju Mi, Yaoxu Lyu, Yankai Fu, Gaole Dai, Langzhe Gu, Tao Li, Yuheng Zhang,
Yixue Zhang, Xinhua Wang, Shichao Fan, Meng Li, Zhen Zhao, Ning Liu, Zhiyuan Xu,
Pei Ren, Junjie Ji, Haonan Liu, Kuan Cheng, Shanghang Zhang, Jian Tang
arXiv, 2025
project page / arXiv / dataset

RoboMIND 2.0 is a comprehensive real-world dataset comprising over 310K dual-arm manipulation trajectories collected across six distinct robot embodiments and 739 complex tasks.

AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation
Sixiang Chen*, Jiaming Liu*, Siyuan Qian*, Han Jiang, Xiaoqi Li, Renrui Zhang,
Zhuoyang Liu, Chenyang Gu, Chengkai Hou, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
NeurIPS, 2025
project page / arXiv / code

Adaptive Coordination Diffusion Transformer (AC-DiT) enhances mobile base and manipulator coordination for end-to-end mobile manipulation.

Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning
Hao Chen*, Jiaming Liu*, Chenyang Gu*, Zhuoyang Liu*, Renrui Zhang, Xiaoqi Li,
Xiao He, Yandong Guo, Chi-Wing FU, Shanghang Zhang, Pheng-Ann Heng
NeurIPS, 2025
project page / arXiv / code

Unlike previous dual-system VLA methods that attach a separate policy head as System 1, FiS-VLA repurposes the final transformer blocks of an intact VLM as System 1, while retaining the full model for System 2 reasoning.

3DS-VLA: A 3D Spatial-Aware Vision Language Action Model for Robust Multi-Task Manipulation
Xiaoqi Li*, Liang Heng*, Jiaming Liu, Yan Shen, Chenyang Gu, Zhuoyang Liu,
Hao Chen, Nuowei Han, Renrui Zhang, Hao Tang, Shanghang Zhang, Hao Dong
CoRL, 2025
paper

3DS-VLA enhances pretrained 2D vision-language models (VLMs) with comprehensive 3D awareness, enabling the prediction of robust end-effector poses.

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
Jiaming Liu*, Hao Chen*, Zhuoyang Liu*, Pengju An*, Renrui Zhang, Chenyang Gu,
Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, Chengkai Hou, Mengdi Zhao,
Kaichen Zhou, Pheng-Ann Heng, Shanghang Zhang
ICLR, 2026
project page / arXiv / code

HybridVLA innovatively integrates diffusion and autoregressive action prediction within a single LLM, fully leveraging the continuity and probabilistic nature of diffusion alongside the reasoning capabilities of autoregressive modeling.

H2R: A Human-to-Robot Data Augmentation for Robot Pre-training from Videos
Guangrun Li*, Yaoxu Lyu*, Zhuoyang Liu*, Chengkai Hou*, Jieyu Zhang, Shanghang Zhang
CVPR workshop, 2025
project page / arXiv / dataset

H2R is a simple data augmentation technique for robot pre-training from videos, which extracts the human hands from first-person videos and replaces them with those of different robots to generate new video data for pre-training.



Education

UC Berkeley
University of California, Berkeley
Visiting Student, Berkeley Global Access (BGA) Program
2026.01 - Present
Peking University
Peking University
B.S. in Computer Science, Yuanpei College
2023.08 - Present

Research Experience

X-Humanoid
Beijing Innovation Center of Humanoid Robotics
Research Intern
2025.08 - 2026.01

Research on Embodied AI and Robot Manipulation

AI2Robotics
AI2Robotics
Research Intern, X-Lab
2025.06 - 2026.01

Focused on Vision-Language-Action (VLA) models

HMI Lab
Peking University
Research Intern, HMI (Human Machine Intelligence) Lab
2024.07 - Present

Embodied AI
Research Advisor: Prof. Shanghang Zhang

Honors & Awards


2025
Academic Rising Star Award Nomination (Undergraduate Program), Peking University
2025
Third Prize in the first round of the RoboTwin Dual-Arm Collaboration Challenge, CVPR 2025
2024
Top 16 in the Mahjong AI Competition Finals, IJCAI 2024
2023
Qin-Jin Scholarship, Peking University

This homepage is designed based on Jon Barron's website and deployed on Github Pages.