ICML ICML 2026 Robot Learning Vision-Language-Action

FOCA: Future-Oriented Conditioning for Data-Efficient Vision-Language-Action Adaptation learning new manipulation skills from a handful of demonstrations

Duc Minh Nguyen*1,2, Nghiem Tuong Diep*1,2, Binh Gia Nguyen*1,2, Trong-Bao Ho1, Doanh Le2, Tan Q. Nguyen1, Thien-Loc Ha1, Nhiem Tran1, Bao Thach1,3, Nhat X. Tran1, Tuan A. Tran4, Artur Habuda5, Philip Lund Møller5, Tran Nguyen Le5, Daniel Sonntag4,6, Mathias Niepert7,8, Khoa D. Doan2, Vu Duong2, Hung Quoc Ngo1, Minh N. Vu1,2, Duy M. H. Nguyen†4,7,8, An Thai Le†1,2, Ngo Anh Vien†1,2

* Equal contribution  ·  † Senior Authors

1 VinRobotics, Vietnam  ·  2 Center for AI Research, VinUniversity, Vietnam  ·  3 University of Utah, USA
4 German Research Center for Artificial Intelligence (DFKI)  ·  5 Technical University of Denmark
6 University of Oldenburg  ·  7 University of Stuttgart  ·  8 Max Planck Research School for Intelligent Systems (IMPRS-IS)

VinRobotics
VinUniversity
TU Denmark
University of Stuttgart
IMPRS-IS
DFKI
Overview of FOCA framework
Figure 1 Overview of FOCA. Our framework injects future-oriented conditioning into VLA adaptation through explicit future interaction prediction and implicit alignment to future goals, enabling data-efficient learning and action-free co-training with video world models.
Abstract

Data-efficient adaptation through future-oriented reasoning.

Can robots learn new skills from only a handful of demonstrations?

Despite impressive progress, today's Vision-Language-Action (VLA) models struggle in this setting. We show that performance drops sharply as training data becomes scarce, exposing a critical weakness of current few-shot adaptation methods.

FOCA addresses this challenge by teaching robots to reason about future interactions rather than simply imitate actions. By combining future-oriented prediction with alignment to future goals, FOCA enables efficient adaptation, supports long-horizon decision making, and naturally enables action-free co-training with video world models through synthetic video supervision. The result is a simple and scalable framework that achieves state-of-the-art performance across simulation and real-world robot manipulation tasks.

Method

Explicit prediction. Implicit alignment. Better adaptation.

Figure 2 Video overview of FOCA. We predict task-grounded future interactions and align them with future goals to enable data-efficient adaptation and action-free learning.
Key Idea 01

Explicit Future Prediction

FOCA predicts task-grounded future interaction embeddings in latent space using representative tokens and a lightweight decoder. By focusing on robot-object interactions rather than the entire scene, it captures anticipated outcomes while remaining robust to task-irrelevant content.

Explicit Future Prediction
Key Idea 02

Implicit Future Alignment

FOCA aligns interaction tokens with future goal observations through an implicit future-conditioning objective. This enables long-horizon reasoning and can be interpreted as learning value-like representations of future task completion.

Implicit Future Alignment
Key Idea 03

Action-Free Supervision

FOCA naturally supports action-free co-training with synthetic videos generated by video world models. Unlike methods that require pseudo-actions or inverse dynamics, FOCA can learn directly from future visual trajectories.

Action-Free Supervision
Results · Real World

Nine tasks on an ALOHA, VRH-3 and UR5 Robot

Failed Base VLA
Success FOCA
Align the bag and fully open the zipper. 150 demos 25%45%
Failed Base VLA
Success FOCA
Dispense a napkin from a container 100 demos 20%45%
Failed Base VLA
Success FOCA
Center the plate on the mat, place the bowl inside it, and set the chopsticks on the right. 150 demos 10%45%
Failed Base VLA
Success FOCA
Tie the shoelaces into a secure knot. 150 demos 70%95%
Failed Base VLA
Success FOCA
Pick up a bowl containing bulk material and pour the material into a designated target container. 100 demos 40%50%
Failed Base VLA
Success FOCA
Move the prompted-color test tube to the target. 40 demos 5%30%
Failed Base VLA
Success FOCA
Place the object onto the jig by aligning with the pins. 98 demos 58%84%
Success
Other Industry Task: Place the object onto the jig by aligning with the pins.
Results · Simulation

Several LIBERO and ROBOCASA tasks.

Failed Base VLA
Success FOCA
Open the top drawer and place the bowl inside.50 demos94%
Failed Base VLA
Success FOCA
Pick up the alphabet soup and place it in the basket.50 demos96%
Failed Base VLA
Success FOCA
Pick up the BBQ sauce and place it in the basket.50 demos92%
Failed Base VLA
Success FOCA
Pick up the black bowl between the plate and the ramekin and place it on the plate.50 demos95%
Failed Base VLA
Success FOCA
Pick up the black bowl from the table center and place it on the plate.50 demos97%
Failed Base VLA
Success FOCA
Turn off the front center burner of the stove.50 demos91%
Failed Base VLA
Success FOCA
Pick the mango from the cabinet and place it on the counter.50 demos93%
Failed Base VLA
Success FOCA
Pick the mug from the counter and place it under the coffee machine dispenser.50 demos90%
Comparison

FOCA vs wide range of VLA models when using full 100% data on LIBERO benchmark

Method Avg 10 Goal Object Spatial
Diff. Policy (Chi et al., RSS 2023)72.450.568.392.578.3
Octo (Ghosh et al., Arxiv 2024)75.151.184.685.778.9
Open-VLA (Kim et al., CoRL 2024)76.553.779.288.484.7
Spatial-VLA (Qu et al., RSS 2025)78.155.578.689.988.2
CoT-VLA (Zhao et al., CVPR 2025)69.087.691.687.581.1
DreamVLA (Zhang et al., NeurIPS 2025)92.689.589.594.097.5
Groot-N1.0 (Bjorck et al., NVIDIA 2025)93.990.693.097.694.4
Groot-N1.5 (Bjorck et al., NVIDIA 2025)94.692.892.898.494.4
EO-1 (Qu et al., Arxiv 2026)94.191.498.696.689.8
Think-Act (Huang et al., NVIDIA 2025)84.470.987.191.488.3
SmolVLA (Shukor et al., Hugging Face 2025)92.582.096.099.093.0
π₀ Fast (Pertsch et al., Physical Intelligence 2025)85.560.288.696.896.4
π₀ (Black et al., Physical Intelligence 2025)94.690.095.498.294.6
▶ FOCA (Ours) 96.6 92.4 97.4 99.8 97.0
Few-shot Adaptation

Performance under limited demonstration budgets on LIBERO (top) and ROBOCASA (below)

Method LIBERO 40% Data LIBERO 10% Data
Avg 10 Object Spatial Avg 10 Object Spatial
π₀ (Black et al., Physical Intelligence 2025) 89.982.095.289.6 77.659.080.683.4
Groot-N1.5 (Bjorck et al., NVIDIA 2025) 91.484.598.990.6 78.262.685.785.7
EO-1 (Qu et al., Arxiv 2026) 91.088.496.086.8 82.265.089.683.0
SmolVLA (Shukor et al., Hugging Face 2025) 90.380.096.090.0 77.351.386.081.0
▶ FOCA (Ours) 94.088.099.693.6 85.369.490.489.4
FOCA generalization performance across Pi-Zero and GR00T N1.5 on RoboCasa
Figure 10 FOCA's generalization performance across π₀ and GR00T N1.5 on RoboCasa, with 30 demos (top) and 100 demos (bottom) on the five most challenging tasks.
Comparison

Comparison with general and task-specific PEFT methods for VLA adaptation in LIBERO

Method 100% 40% 10%
Control-VLA (Li et al., CoRL 2025)95.691.378.4
LoRA (r=64) (Hu et al., ICLR 2022)94.290.278.2
DoRA (r=64) (Liu et al., ICML 2024)94.792.078.6
▶ FOCA (ours) 96.6 94.0 85.3
Comparison

Performance comparison between FOCA variants and pseudo-actions learned via IGM from DreamGen-generated synthetic videos

Data scale π0 baseline IGM FOCA Implicit FOCA + DreamGen
40% data 89.9 90.2 93.0 95.7
10% data 77.6 76.8 83.6 86.4
Citation

BibTeX

@inproceedings{foca2026,
    title={FOCA: Future-Oriented Conditioning for Data-Efficient Vision-Language-Action Adaptation},
    author={Nguyen, Duc Minh and Diep, Nghiem Tuong and Nguyen, Binh Gia and Ho, Trong-Bao and Le, Doanh and Nguyen, Tan Q. and Ha, Thien-Loc and Tran, Nhiem and Thach, Bao and Tran, Nhat X. and Tran, Tuan A. and Habuda, Artur and Møller, Philip Lund and Le, Tran Nguyen and Sonntag, Daniel and Niepert, Mathias and Doan, Khoa D. and Duong, Vu and Ngo, Hung Quoc and Vu, Minh N. and Nguyen, Duy M. H. and Le, An Thai and Vien, Ngo Anh},
    booktitle={International Conference on Machine Learning (ICML)},
    year={2026}
}