
Abstract
As AI shifts from passive models to active agents, the key bottleneck has become data—its quality, coverage, and controllability. Constructing labeled datasets, whether for per-pixel segmentation or multi-step user interaction, is slow, expensive, and privacy-sensitive. At the same time, internet-scale corpora often lack the grounded supervision and action traces required for agentic tasks. Synthetic data provides a compelling alternative: it is abundant, controllable, and can be precisely aligned with target tasks, enabling safer and more scalable advances in performance.
In this talk, I will present my research on synthetic data generation spanning perception and agentic tasks. I will begin with image-centric efforts, from interactive data creation to game-based pipelines, before introducing on-demand approaches such as Neural-Sim for task-aware image generation. I will then discuss recent work on synthetic pipelines for long-horizon agentic tasks. Collectively, these efforts will demonstrate how synthetic data can accelerate the development of robust, interpretable, and safe AI systems across both digital and physical domains.
For more info, please follow this link.