
Abstract
Vision Foundation Models (VFMs) have revolutionized computer vision, achieving remarkable generalization across diverse 2D image tasks. However, building general-purpose intelligent agents requires perception that goes beyond static 2D pixels—integrating language, 3D spatial reasoning, and temporal dynamics. In this talk, I will discuss how we can extend VFMs along these three critical dimensions. First, I will introduce Language-Guided VFMs, which leverage natural language as an interface to enhance visual reasoning. Next, I will present 3D-enabled VFMs, which bridge the gap between 2D vision models and the real-world 3D environments where intelligent agents operate. Finally, I will explore Dynamic-Aware VFMs, which incorporate temporal understanding for video and 4D scene reasoning. Throughout the talk, I will highlight key challenges, present novel approaches, and discuss the impact of these advancements on embodied AI, robotics, and multimodal intelligence. I will conclude by outlining future directions for enabling general-purpose vision agents that see, understand, and interact with the world in a human-like manner.
For more info, please follow this link.