ICML 2026 Robot Learning Vision-Language-Action

FOCA: Future-Oriented Conditioning for Data-Efficient Vision-Language-Action Adaptation learning new manipulation skills from a handful of demonstrations

Duc Minh Nguyen^*1,2, Nghiem Tuong Diep^*1,2, Binh Gia Nguyen^*1,2, Trong-Bao Ho¹, Doanh Le², Tan Q. Nguyen¹, Thien-Loc Ha¹, Nhiem Tran¹, Bao Thach^1,3, Nhat X. Tran¹, Tuan A. Tran⁴, Artur Habuda⁵, Philip Lund Møller⁵, Tran Nguyen Le⁵, Daniel Sonntag^4,6, Mathias Niepert^7,8, Khoa D. Doan², Vu Duong², Hung Quoc Ngo¹, Minh N. Vu^1,2, Duy M. H. Nguyen^†4,7,8, An Thai Le^†1,2, Ngo Anh Vien^†1,2

* Equal contribution · † Senior Authors

¹ VinRobotics, Vietnam · ² Center for AI Research, VinUniversity, Vietnam · ³ University of Utah, USA
⁴ German Research Center for Artificial Intelligence (DFKI) · ⁵ Technical University of Denmark
⁶ University of Oldenburg · ⁷ University of Stuttgart · ⁸ Max Planck Research School for Intelligent Systems (IMPRS-IS)

Paper Code Models and Datasets Video

Figure 1 Overview of FOCA. Our framework injects future-oriented conditioning into VLA adaptation through explicit future interaction prediction and implicit alignment to future goals, enabling data-efficient learning and action-free co-training with video world models.

Abstract

Data-efficient adaptation through future-oriented reasoning.

Can robots learn new skills from only a handful of demonstrations?

Despite impressive progress, today's Vision-Language-Action (VLA) models struggle in this setting. We show that performance drops sharply as training data becomes scarce, exposing a critical weakness of current few-shot adaptation methods.

FOCA addresses this challenge by teaching robots to reason about future interactions rather than simply imitate actions. By combining future-oriented prediction with alignment to future goals, FOCA enables efficient adaptation, supports long-horizon decision making, and naturally enables action-free co-training with video world models through synthetic video supervision. The result is a simple and scalable framework that achieves state-of-the-art performance across simulation and real-world robot manipulation tasks.

Method

Explicit prediction. Implicit alignment. Better adaptation.

Figure 2 Video overview of FOCA. We predict task-grounded future interactions and align them with future goals to enable data-efficient adaptation and action-free learning.

Key Idea 01

Explicit Future Prediction

FOCA predicts task-grounded future interaction embeddings in latent space using representative tokens and a lightweight decoder. By focusing on robot-object interactions rather than the entire scene, it captures anticipated outcomes while remaining robust to task-irrelevant content.

Key Idea 02

Implicit Future Alignment

FOCA aligns interaction tokens with future goal observations through an implicit future-conditioning objective. This enables long-horizon reasoning and can be interpreted as learning value-like representations of future task completion.

Key Idea 03

Action-Free Supervision

FOCA naturally supports action-free co-training with synthetic videos generated by video world models. Unlike methods that require pseudo-actions or inverse dynamics, FOCA can learn directly from future visual trajectories.

Results · Real World

Nine tasks on an ALOHA, VRH-3 and UR5 Robot

Failed Base VLA

Success FOCA

Align the bag and fully open the zipper. 150 demos 25%45%

Failed Base VLA

Success FOCA

Dispense a napkin from a container 100 demos 20%45%

Failed Base VLA

Success FOCA

Center the plate on the mat, place the bowl inside it, and set the chopsticks on the right. 150 demos 10%45%

Failed Base VLA

Success FOCA

Tie the shoelaces into a secure knot. 150 demos 70%95%

Failed Base VLA

Success FOCA

Pick up a bowl containing bulk material and pour the material into a designated target container. 100 demos 40%50%

Failed Base VLA

Success FOCA

Move the prompted-color test tube to the target. 40 demos 5%30%

Failed Base VLA

Success FOCA

Place the object onto the jig by aligning with the pins. 98 demos 58%84%

Success

Other Industry Task: Place the object onto the jig by aligning with the pins.

Results · Simulation

Several LIBERO and ROBOCASA tasks.

Failed Base VLA

Success FOCA

Open the top drawer and place the bowl inside.50 demos94%

Failed Base VLA

Success FOCA

Pick up the alphabet soup and place it in the basket.50 demos96%

Failed Base VLA

Success FOCA

Pick up the BBQ sauce and place it in the basket.50 demos92%

Failed Base VLA

Success FOCA

Pick up the black bowl between the plate and the ramekin and place it on the plate.50 demos95%

Failed Base VLA

Success FOCA

Pick up the black bowl from the table center and place it on the plate.50 demos97%

Failed Base VLA

Success FOCA

Turn off the front center burner of the stove.50 demos91%

Failed Base VLA

Success FOCA

Pick the mango from the cabinet and place it on the counter.50 demos93%

Failed Base VLA

Success FOCA

Pick the mug from the counter and place it under the coffee machine dispenser.50 demos90%

Comparison

FOCA vs wide range of VLA models when using full 100% data on LIBERO benchmark

Method	Avg	10	Goal	Object	Spatial
Diff. Policy (Chi et al., RSS 2023)	72.4	50.5	68.3	92.5	78.3
Octo (Ghosh et al., Arxiv 2024)	75.1	51.1	84.6	85.7	78.9
Open-VLA (Kim et al., CoRL 2024)	76.5	53.7	79.2	88.4	84.7
Spatial-VLA (Qu et al., RSS 2025)	78.1	55.5	78.6	89.9	88.2
CoT-VLA (Zhao et al., CVPR 2025)	69.0	87.6	91.6	87.5	81.1
DreamVLA (Zhang et al., NeurIPS 2025)	92.6	89.5	89.5	94.0	97.5
Groot-N1.0 (Bjorck et al., NVIDIA 2025)	93.9	90.6	93.0	97.6	94.4
Groot-N1.5 (Bjorck et al., NVIDIA 2025)	94.6	92.8	92.8	98.4	94.4
EO-1 (Qu et al., Arxiv 2026)	94.1	91.4	98.6	96.6	89.8
Think-Act (Huang et al., NVIDIA 2025)	84.4	70.9	87.1	91.4	88.3
SmolVLA (Shukor et al., Hugging Face 2025)	92.5	82.0	96.0	99.0	93.0
π₀ Fast (Pertsch et al., Physical Intelligence 2025)	85.5	60.2	88.6	96.8	96.4
π₀ (Black et al., Physical Intelligence 2025)	94.6	90.0	95.4	98.2	94.6
▶ FOCA (Ours)	96.6	92.4	97.4	99.8	97.0

Few-shot Adaptation

Performance under limited demonstration budgets on LIBERO (top) and ROBOCASA (below)

Method	LIBERO 40% Data				LIBERO 10% Data
	Avg	10	Object	Spatial	Avg	10	Object	Spatial
π₀ (Black et al., Physical Intelligence 2025)	89.9	82.0	95.2	89.6	77.6	59.0	80.6	83.4
Groot-N1.5 (Bjorck et al., NVIDIA 2025)	91.4	84.5	98.9	90.6	78.2	62.6	85.7	85.7
EO-1 (Qu et al., Arxiv 2026)	91.0	88.4	96.0	86.8	82.2	65.0	89.6	83.0
SmolVLA (Shukor et al., Hugging Face 2025)	90.3	80.0	96.0	90.0	77.3	51.3	86.0	81.0
▶ FOCA (Ours)	94.0	88.0	99.6	93.6	85.3	69.4	90.4	89.4

FOCA generalization performance across Pi-Zero and GR00T N1.5 on RoboCasa — Figure 10 FOCA's generalization performance across π₀ and GR00T N1.5 on RoboCasa, with 30 demos (top) and 100 demos (bottom) on the five most challenging tasks.

Comparison

Comparison with general and task-specific PEFT methods for VLA adaptation in LIBERO

Method	100%	40%	10%
Control-VLA (Li et al., CoRL 2025)	95.6	91.3	78.4
LoRA (r=64) (Hu et al., ICLR 2022)	94.2	90.2	78.2
DoRA (r=64) (Liu et al., ICML 2024)	94.7	92.0	78.6
▶ FOCA (ours)	96.6	94.0	85.3

Comparison

Performance comparison between FOCA variants and pseudo-actions learned via IGM from DreamGen-generated synthetic videos

Data scale	π0 baseline	IGM	FOCA Implicit	FOCA + DreamGen
40% data	89.9	90.2	93.0	95.7
10% data	77.6	76.8	83.6	86.4

Citation

BibTeX

@inproceedings{foca2026,
    title={FOCA: Future-Oriented Conditioning for Data-Efficient Vision-Language-Action Adaptation},
    author={Nguyen, Duc Minh and Diep, Nghiem Tuong and Nguyen, Binh Gia and Ho, Trong-Bao and Le, Doanh and Nguyen, Tan Q. and Ha, Thien-Loc and Tran, Nhiem and Thach, Bao and Tran, Nhat X. and Tran, Tuan A. and Habuda, Artur and Møller, Philip Lund and Le, Tran Nguyen and Sonntag, Daniel and Niepert, Mathias and Doan, Khoa D. and Duong, Vu and Ngo, Hung Quoc and Vu, Minh N. and Nguyen, Duy M. H. and Le, An Thai and Vien, Ngo Anh},
    booktitle={International Conference on Machine Learning (ICML)},
    year={2026}
}