It's showtime!
LingBot-VA is a versatile all-rounder, excelling across a wide range of settings—from long-horizon tasks and high-precision control to deformable and articulated object manipulation.
Simulation
We evaluate LingBot-VA on two simulation benchmarks, RoboTwin 2.0 and LIBERO, which encompass a variety of manipulation tasks across diverse robot embodiments. Our results demonstrate consistent improvements over state-of-the-art methods.
How it works
LingBot-VA is an autoregressive diffusion framework that architecturally unifies visual dynamics prediction and action inference within a single interleaved sequence. It enables robots to simultaneously reason about future states and execute precise closed-loop control.
Large-scale Pretrain
We pretrain LingBot-VA on large-scale robotics video-action datasets to learn rich visual dynamics, establishing a strong foundation for understanding how the physical world evolves and operating within it.
Framework
Our framework operates in three stages: (1) Autoregressive video generation predicts future frames conditioned on current observations and language instructions; (2) An inverse dynamics model (IDM) decodes actions from the predicted video; (3) After execution, real observations replace the video KV-cache, grounding the our video-action model in actual outcomes and enabling closed-loop control.
Our inverse dynamics model (IDM) accurately decodes actions from predicted videos, generalizing well across diverse environments and embodiments.
Prediction vs Reality
The predicted video (left) closely matches the real observation (right) after executing the decoded actions, demonstrating our model's accurate world modeling capability.
Why we choose video model?
We observe distinctive strengths of autoregressive video models—most notably their long-term memory and sample efficiency. That's why we believe video models could establish a fresh and independent foundation for robot learning.
We use a simple setup that clearly tests whether the model truly has memory. In this task, the model must make decisions in sequences that contain repeated states. If the model relies only on the current observation and has no memory, it can easily get confused by repeated states and lose track of where it is in the sequence.
For example, in the state sequence A → B → A → C, a memoryless model cannot tell whether it is seeing A for the first or second time. As a result, it learns P(next∣A)=0.5 for both B and C, which can cause it to take incorrect transitions or get stuck in loops. In contrast, with access to full history, our model can distinguish the same state under different contexts and learn that after A → B → A, the next state should be C, i.e., P(C∣A→B→A)=1. This allows the model to reliably complete the sequence without getting confused by repeated states.
We test this with a task requiring the robot to: open the right box, close it, then open the left box. The right box looks identical before opening and after closing, creating a recurrent state. Without memory, π0.5 cannot distinguish these states and gets stuck in loops. Our model remembers the full history and completes the task correctly.
In this task, we require the agent to repeatedly wipe the same plate back and forth for three rounds. Each back-and-forth motion brings the agent to visually similar or identical states, creating repeated states throughout the trajectory. Counting therefore requires tracking how many actions have already been performed. Without memory, π0.5 cannot count and instead exhibits random behavior. In contrast, our model accurately tracks the count using its history, allowing it to complete the required number of repetitions reliably.
Video models are highly data-efficient when adapting to new tasks. With only a few demonstrations, the model can quickly adjust its predictions to match the desired behavior. This few-shot capability greatly reduces the amount of data needed to teach robots new skills, making it much easier to deploy robots in new environments and tasks.