Workshop on Video Large Language Models (VidLLMs)

Workshop Program

Morning Session

8:30 AM – 8:40 AM Chairs' Opening Remarks
8:40 AM – 9:10 AM Keynote Talk 1: Chelsea Finn, Stanford University
Developing Steerable, Generalizable Vision-Language-Action Models
Video
9:10 AM – 9:40 AM Keynote Talk 2: Mohamed Elhoseiny, KAUST
Towards Scalable and Structured Understanding in Video LLMs
Video
9:40 AM – 10:10 AM Keynote Talk 3: Afshin Dehghan, Apple
Advancing Video Understanding: From Training-Free to Streaming Video LLM
Video
10:10 AM – 10:30 AM Break
10:30 AM – 11:00 AM Keynote Talk 4: Cordelia Schmid, Google & INRIA
Video reasoning and grounding: methods & benchmarks
Video
11:00 AM – 11:30 AM Keynote Talk 5: Mohit Bansal, UNC & Amazon
Generative Video LLMs: Planning Agents and Multimodal Composition
Video
11:30 AM – 12:00 PM Keynote Talk 6: Fahad Khan, MBZUAI & Linkoping University
Towards Detailed Video Understanding in Generative AI Era
Video
12:00 PM – 1:00 PM Lunch Break

Afternoon Session

1:00 PM – 2:15 PM Panel Discussion: VidLLMs vs Expert Models
Moderator: Mubarak Shah (UCF & Amazon)
Panel: Salman Khan(Mohamed bin Zayed University of AI), Rene Vidal (UPenn & Amazon), Haoqi Fan (ByteDance Seed)
2:15 PM – 2:30 PM Break
2:30 PM – 2:45 PM Track 1 Winner Presentation
Composed Video Retrieval (CoVR) Challenge
Video
2:45 PM – 3:00 PM Track 2 Winner Presentation
Complex Video Reasoning & Robustness Evaluation (CVRR) Challenge
Video
3:00 PM – 3:15 PM Track 3 Winner Presentation
Multi Lingual Challenge
Video
3:15 PM – 3:30 PM Oral Paper Presentation
How Important are Videos for Training Video LLMs?
George Lydakis1, Alexander Hermans1, Ali Athar2, Daan de Geus1,3, Bastian Leibe1
1RWTH Aachen University, 2ByteDance Seed, 3Eindhoven University of Technology
Video
3:30 PM – 3:40 PM Poster Session Preparation
3:40 PM – 5:00 PM Poster Session
Exhibit Hall D, Poster Boards #334 - #373