Workshop on Video Large Language Models (VidLLMs)

Invited Speakers & Keynotes

Dr. Afshin Dehghan

Dr. Afshin Dehghan

Apple

Sr AIML Manager, Apple. Leading the Multimodal Intelligence Team in the Hardware Technology group.

Topic:
Advancing Video Understanding: From Training-Free to Streaming Video LLM

Abstract

We trace the progression from training-free methods to architectures specifically designed for streaming video, with the goal of enabling real-time, proactive assistants. We'll discuss strategies for adapting existing VLMs to handle temporal reasoning and large-scale multimodal integration, both with and without fine-tuning. The talk will highlight recent works from our group, including SF-LLaVA, SF-LLaVA 1.5, and SteamBridge.

Dr. Chelsea Finn

Dr. Chelsea Finn

Stanford University

Expert in Reinforcement Learning, with pioneering work on end-to-end deep learning for robotics and alignment of LLMs.

Topic:
Developing Steerable, Generalizable Vision-Language-Action Models

Abstract

TBA

Prof. Cordelia Schmid

Dr. Cordelia Schmid

Google, INRIA

Renowned for her contributions to large-scale visual learning and retrieval, and multi-modal representation learning.

Topic:
Video reasoning and grounding: methods & benchmarks

Abstract

TBA

Prof. Fahad Khan

Dr. Fahad Khan

MBZUAI & Linkoping University

Co-author of Video ChatGPT and PG-Video LLaVA, with expertise in multi-modal learning and large-scale CV tasks.

Topic:
Towards Detailed Video Understanding in Generative AI Era

Abstract

Machine perception that corresponds to the ability to understand the visual world based on the input from sensors, such as cameras is one of the central problems in Artificial Intelligence. To this end, recent years have witnessed tremendous progress by developing large multimodal models (LMMs) for video perception tasks having real-world applications in e.g., robotics, autonomous driving and surveillance. In this talk, I will present our recent results in large multimodal models (LMMs) for videos. I will first discuss enriching video LMMs with detailed visual semantics to achieve spatio-temporal pixel grounding and poitning capabilities conditioned on textual queries. Next, I will present our findings in evaluating video LMMs in a culturally diverse multilingual setting. Finally, I will discuss developing efficient video LMMs for resource-constrained devices.

Prof. Mohamed Elhoseiny

Dr. Mohamed Elhoseiny

KAUST

Associate Professor, KAUST. Focusing on computer vision, especially zero/few-shot learning, vision-language models, and creative AI.

Topic:
Towards Scalable and Structured Understanding in Video LLMs

Abstract

TBA

Prof. Mohit Bansal

Dr. Mohit Bansal

UNC, Amazon

Professor and Director of the MURGe-Lab at UNC Chapel Hill, specializing in natural language processing and multimodal AI, with recent works focusing on video understanding, vision-language integration, and LLM reasoning.

Topic:
Generative Video LLMs: Planning Agents and Multimodal Composition

Abstract

TBA

Panel Discussion: VidLLMs vs Expert Models

Moderator: Dr. Mubarak Shah (UCF & Amazon)

Panelists: Salman Khan(Mohamed bin Zayed University of AI), Rene Vidal (UPenn & Amazon), Haoqi Fan (ByteDance Seed)

This discussion will explore the strengths and limitations of VidLLMs compared to specialized expert models in computer vision. Are VidLLMs the future, or do domain-specific systems still reign supreme in terms of accuracy?