Causal video-action world model for generalist robot control
We introduce LingBot-VA, an autoregressive framework that unifies video world modeling and policy learning through joint prediction of future frames and actions. Our method achieves state-of-the-art results on both real-world and simulation benchmarks, highlighting clear promise in long-horizon manipulation, data-efficient post-training, and strong generalization to novel scenes and object configurations.

It's showtime!

LingBot-VA is a versatile all-rounder, excelling across a wide range of settings—from long-horizon tasks and high-precision control to deformable and articulated object manipulation.

Progress Rate
100% 80% 60% 40% 20% 0%
Success Rate
100% 80% 60% 40% 20% 0%
Ours
π0.5
Task Name
Ours: 0%
π0.5: 0%

Simulation

We evaluate LingBot-VA on two simulation benchmarks, RoboTwin 2.0 and LIBERO, which encompass a variety of manipulation tasks across diverse robot embodiments. Our results demonstrate consistent improvements over state-of-the-art methods.

Easy Setting
100% 90% 80% 70% 60% 50%
Hard Setting
100% 90% 80% 70% 60% 50%
Ours
π0
π0.5
LIBERO Benchmark
100% 90% 80% 70% 60%
Ours
π0
π0+FAST
X-VLA
Metric
Ours: 0%
π0: 0%
π0.5: 0%
Suite
Ours: 0%
π0: 0%
π0+FAST: 0%
X-VLA: 0%

How it works

LingBot-VA is an autoregressive diffusion framework that architecturally unifies visual dynamics prediction and action inference within a single interleaved sequence. It enables robots to simultaneously reason about future states and execute precise closed-loop control.

Large-scale Pretrain

We pretrain LingBot-VA on large-scale robotics video-action datasets to learn rich visual dynamics, establishing a strong foundation for understanding how the physical world evolves and operating within it.

Framework

Our framework operates in three stages: (1) Autoregressive video generation predicts future frames conditioned on current observations and language instructions; (2) An inverse dynamics model (IDM) decodes actions from the predicted video; (3) After execution, real observations replace the video KV-cache, grounding the our video-action model in actual outcomes and enabling closed-loop control.

Our inverse dynamics model (IDM) accurately decodes actions from predicted videos, generalizing well across diverse environments and embodiments.

Ground-truth
Prediction

Prediction vs Reality

The predicted video (left) closely matches the real observation (right) after executing the decoded actions, demonstrating our model's accurate world modeling capability.

Prediction
Reality

Why we choose video model?

We observe distinctive strengths of autoregressive video models—most notably their long-term memory and sample efficiency. That's why we believe video models could establish a fresh and independent foundation for robot learning.

We use a simple setup that clearly tests whether the model truly has memory. In this task, the model must make decisions in sequences that contain repeated states. If the model relies only on the current observation and has no memory, it can easily get confused by repeated states and lose track of where it is in the sequence.

For example, in the state sequence A → B → A → C, a memoryless model cannot tell whether it is seeing A for the first or second time. As a result, it learns P(next∣A)=0.5 for both B and C, which can cause it to take incorrect transitions or get stuck in loops. In contrast, with access to full history, our model can distinguish the same state under different contexts and learn that after A → B → A, the next state should be C, i.e., P(C∣A→B→A)=1. This allows the model to reliably complete the sequence without getting confused by repeated states.

A
B
A
C
Recurrent State

We test this with a task requiring the robot to: open the right box, close it, then open the left box. The right box looks identical before opening and after closing, creating a recurrent state. Without memory, π0.5 cannot distinguish these states and gets stuck in loops. Our model remembers the full history and completes the task correctly.

π0.5
Ours
Counting

In this task, we require the agent to repeatedly wipe the same plate back and forth for three rounds. Each back-and-forth motion brings the agent to visually similar or identical states, creating repeated states throughout the trajectory. Counting therefore requires tracking how many actions have already been performed. Without memory, π0.5 cannot count and instead exhibits random behavior. In contrast, our model accurately tracks the count using its history, allowing it to complete the required number of repetitions reliably.

π0.50 times/4 times
Ours

Video models are highly data-efficient when adapting to new tasks. With only a few demonstrations, the model can quickly adjust its predictions to match the desired behavior. This few-shot capability greatly reduces the amount of data needed to teach robots new skills, making it much easier to deploy robots in new environments and tasks.

Robotwin
100% 80% 60% 40% 20%
Real-world Experiment
100% 80% 60% 40% 20%
Ours
π0.5