PG-Video-LLaVA: Pixel Grounding Large Video-Language Models Shehan Munasinghe1, Rusiru Thushara1, Muhammad Maaz1, Hanoona Abdul Rasheed1, Salman Khan1,2, Mubarak Shah3, Fahad Shahbaz Khan1,4 1Mohamed bin Zayed University of AI 2Australian National University 3University of Central Florida 4Linköping University
Number it: Temporal Grounding Videos like Flipping Manga Yongliang Wu1,2,4, Xinting Hu3, Yuyang Sun1,2, Yizhou Zhou4, Wenbo Zhu5, Fengyun Rao4, Bernt Schiele3, Xu Yang1,2 1Southeast University, China 2Key Laboratory of New Generation Artificial Intelligence Technology & Its Interdisciplinary Applications (Southeast University), Ministry of Education, China 3Max Planck Institute for Informatics, Saarland Informatics Campus, Germany 4WeChat Vision, Tencent Inc., China 5University of California, Berkeley, America
Lost in Time: A New Temporal Benchmark for Video LLMs Daniel Cores1, Michael Dorkenwald2, Manuel Mucientes1, Cees G. M. Snoek2, Yuki M Asano3 1CiTIUS, University of Santiago de Compostela 2QUVA Lab, University of Amsterdam 3Fundamental AI Lab, University of Technology Nuremberg
Moment Sampling in Video LLMs for Long-Form Video QA Mustafa Chasmai1, Gauri Jagatap2, Gouthaman KV2, Grant Van Horn1, Subhransu Maji1, Andrea Fanelli2 1University of Massachusetts Amherst 2Dolby Laboratories