Robbyant

Causal video-action world model for generalist robot control

Tech Report

Code

Hugging Face

ModelScope

We introduce LingBot-VA, an autoregressive framework that unifies video world modeling and policy learning through joint prediction of future frames and actions. Our method achieves state-of-the-art results on both real-world and simulation benchmarks, highlighting clear promise in long-horizon manipulation, data-efficient post-training, and strong generalization to novel scenes and object configurations.

It's showtime!

LingBot-VA is a versatile all-rounder, excelling across a wide range of settings—from long-horizon tasks and high-precision control to deformable and articulated object manipulation.

Progress Rate

100% 80% 60% 40% 20% 0%

Success Rate

100% 80% 60% 40% 20% 0%

Ours

π_0.5

Simulation

We evaluate LingBot-VA on two simulation benchmarks, RoboTwin 2.0 and LIBERO, which encompass a variety of manipulation tasks across diverse robot embodiments. Our results demonstrate consistent improvements over state-of-the-art methods.

Easy Setting

100% 90% 80% 70% 60% 50%

Hard Setting

100% 90% 80% 70% 60% 50%

Ours

π₀

π_0.5

LIBERO Benchmark

100% 90% 80% 70% 60%

Ours

π₀

π₀+FAST

X-VLA

How it works

LingBot-VA is an autoregressive diffusion framework that architecturally unifies visual dynamics prediction and action inference within a single interleaved sequence. It enables robots to simultaneously reason about future states and execute precise closed-loop control.

Large-scale Pretrain

We pretrain LingBot-VA on large-scale robotics video-action datasets to learn rich visual dynamics, establishing a strong foundation for understanding how the physical world evolves and operating within it.

Framework

Our framework operates in three stages: (1) Autoregressive video generation predicts future frames conditioned on current observations and language instructions; (2) An inverse dynamics model (IDM) decodes actions from the predicted video; (3) After execution, real observations replace the video KV-cache, grounding the our video-action model in actual outcomes and enabling closed-loop control.

Our inverse dynamics model (IDM) accurately decodes actions from predicted videos, generalizing well across diverse environments and embodiments.

Ground-truth

Prediction

Prediction vs Reality

The predicted video (left) closely matches the real observation (right) after executing the decoded actions, demonstrating our model's accurate world modeling capability.

Prediction

Reality

Why we choose video model?

We observe distinctive strengths of autoregressive video models—most notably their long-term memory and sample efficiency. That's why we believe video models could establish a fresh and independent foundation for robot learning.

We use a simple setup that clearly tests whether the model truly has memory. In this task, the model must make decisions in sequences that contain repeated states. If the model relies only on the current observation and has no memory, it can easily get confused by repeated states and lose track of where it is in the sequence.

For example, in the state sequence A → B → A → C, a memoryless model cannot tell whether it is seeing A for the first or second time. As a result, it learns P(next∣A)=0.5 for both B and C, which can cause it to take incorrect transitions or get stuck in loops. In contrast, with access to full history, our model can distinguish the same state under different contexts and learn that after A → B → A, the next state should be C, i.e., P(C∣A→B→A)=1. This allows the model to reliably complete the sequence without getting confused by repeated states.

→

Recurrent State

We test this with a task requiring the robot to: open the right box, close it, then open the left box. The right box looks identical before opening and after closing, creating a recurrent state. Without memory, π_0.5 cannot distinguish these states and gets stuck in loops. Our model remembers the full history and completes the task correctly.

π_0.5

Ours

Counting

In this task, we require the agent to repeatedly wipe the same plate back and forth for three rounds. Each back-and-forth motion brings the agent to visually similar or identical states, creating repeated states throughout the trajectory. Counting therefore requires tracking how many actions have already been performed. Without memory, π_0.5 cannot count and instead exhibits random behavior. In contrast, our model accurately tracks the count using its history, allowing it to complete the required number of repetitions reliably.

π_0.50 times/4 times

Ours

Video models are highly data-efficient when adapting to new tasks. With only a few demonstrations, the model can quickly adjust its predictions to match the desired behavior. This few-shot capability greatly reduces the amount of data needed to teach robots new skills, making it much easier to deploy robots in new environments and tasks.

Robotwin

100% 80% 60% 40% 20%

Real-world Experiment

100% 80% 60% 40% 20%

Ours

π_0.5