| 1/8/24 | Lecture-1-Introduction [Video of Presentation] | Dr. Shah | |
| 1/10/24 | Lecture-2-Transformers Introduction [Video of Presentation] | Dr. Shah | |
| 1/15/24 | Martin Luther King Jr. Day No Class | | |
| 1/17/24 | Lecture-3-CLIP [Video of Presentation] | Dr. Shah | |
| 1/22/24 | Lecture-4-Visual-Language Models Introduction Part-I: CoCA, PALI [Video of Presentation] | Dr. Shah | |
| 1/24/24 | Lecture-5-Visual-Language Models Introduction Part-II: FLAMINGO, FLAVA, PAINTER, BLIP-2 [Video of Presentation] | Dr. Shah | |
| 1/29/24 | Lecture-6-Visual-Language Models Introduction Part-III: Image-Bind, Language-Bind, LLaVA [Video of Presentation] | Dr. Shah | |
| 1/31/24 | Lecture-7-Visual-Language Models Introduction Part-IV: Video ChatGPT, PG-Video LLaVA [Video of Presentation] | Dr. Shah | |
| 2/5/24 | Paper-1 FILIP: Fine-grained Interactive Language-Image Pre-Training [Presentation PDF] [Video of Presentation] | Group-1: David Shatwell Pittaluga, Anthony Bilic, Kevin Zhai, Zain Ulabedeen Farhat, Kunyang Li | |
| 2/7/24 | Paper-2 HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention [Presentation PDF] [Video of Presentation] | Group-8: Anantapadmanaabha Prasannakumar, Abdulrahman Al Sumaih, Andrew Ballen, Zhen Hao Sia, Nicholas Tidwell | |
| 2/12/24 | Paper-3 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation [ Presentation PDF ] [Video of Presentation] | Group-4: Michael Cruz, Christopher Lee, Saurabh Aggarwal, Robert Martin, Taylor Tiedge | |
| 2/14/24 | Paper-4 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [Presentation PDF] [Video of Presentation] | Group-8: Anantapadmanaabha Prasannakumar, Abdulrahman Al Sumaih, Andrew Ballen, Zhen Hao Sia, Nicholas Tidwell | |
| 2/19/24 | Paper-5 MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [ Presentation PDF ] [Video of Presentation] | Group-5: Adrian Mauricio-Gonzalez, Cesar Hernandez, Ehtesamul Azim, Jatin Bharati | |
| 2/21/24 | Paper-6 MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound [ Presentation PDF ] [Video of Presentation] | Group-6: Reeshoon Sayera, Shoumik Ghosh, Ifty Rezwan, Xiao Hang Wang, Xitong Li | |
| 2/26/24 | Paper-7 Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic [ Presentation PDF ] [Video of Presentation] | Group-2: Charlee Mione, Isaac Tuckey, Mahad Ali, Anthony Jackson, Rafeeq Shodeinde | |
| 2/28/24 | Student Project Presentations Ideas -I | | |
| 3/4/24 | Paper-8 Video-LLaVA: Learning United Visual Representation by Alignment Before Projection [ Presentation PDF ] [Video of Presentation] | Group-1: David Shatwell Pittaluga, Athony Bilic, Kevin Zhai, Zain Ulabedeen Farhat, Kunyang Li | |
| 3/6/24 | Paper-9 PG-Video-LLaVA: Pixel Grounding Large Video-Language Models [ Presentation PDF ] [Video of Presentation] | Group-3: Tyler VanderMate, Ashton Frias, Nicholas Gray, Wen-Kai Chen, Abhinav Kotta | |
| 3/11/24 | Paper-10 Evaluating Object Hallucination in Large Vision-Language Models [ Presentation PDF ] [Video of Presentation] | Group-6: Reeshoon Sayera, Shoumik Ghosh, Ifty Rezwan, Xiao Hang Wang, Xitong Li | |
| 3/13/24 | Paper-11 Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [ Presentation PDF ] [Video of Presentation] | Group-3: Tyler VanderMate, Ashton Frias, Nicholas Gray, Wen-Kai Chen, Abhinav Kotta | |
| 3/18/24 | Spring Break | | |
| 3/20/24 | Spring Break | | |
| 3/25/24 | Student Project Presentation Update - 1 | Groups: 8, 7, 6, 5 | |
| 3/27/24 | Student Project Presentation Update - 1 | Groups: 4, 3, 2, 1 | |
| 4/1/24 | Paper-12 CM3Leon: Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning [ Presentation PDF ] [Video of Presentation] | Group-4: Michael Cruz, Christopher Lee, Saurabh Aggarwal, Robert Martin, Taylor Tiedge | |
| 4/3/24 | Paper-13 OWLv2: Scaling Open-Vocabulary Object Detection [ Presentation PDF ] [Video of Presentation] | Group-7: Suranadi Dodampagamage, Daniel Cisneros, Bradley Racey, Salem Long, Andrew El-Kommos | |
| 4/8/24 | Paper-14 Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [ Presentation PDF ] [Video of Presentation] | Group-7: Suranadi Dodampagamage, Daniel Cisneros, Bradley Racey, Salem Long, Andrew El-Kommos | |
| 4/10/24 | Student Project Presentation Update - 2 | | |
| 4/15/24 | Student Project Presentation Update - 2 | | |
| 4/17/24 | Paper-15 FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions [ Presentation PDF ] [Video of Presentation] | Group-2: Charlee Mione, Isaac Tuckey, Mahad Ali, Anthony Jackson, Rafeeq Shodeinde | |
| 4/22/24 | Paper-16 MIMIC-IT: Multi-Modal In-Context Instruction Tuning [ Presentation PDF ] [Video of Presentation] | Group-5: Adrian Mauricio-Gonzalez, Cesar Hernandez, Ehtesamul Azim, Jatin Bharati | |
| 4/24/24 | Final Exam: 1pm - 4pm | | |
| Potential Papers | - FILIP: Fine-grained Interactive Language-Image Pre-Training
- HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention
- RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-training
- ALIP: Adaptive Language-Image Pre-training with Synthetic Caption
- COCA: Contrastive Captioners are Image-Text Foundation Models
- MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound
- Sigmoid Loss for Language Image Pre-Training
- KOSMOS-2: GROUNDING MULTIMODAL LARGE LANGUAGE MODELS TO THE WORLD
- KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models
- Flamingo: a Visual Language Model for Few-Shot Learning
- LLaVA: Visual Instruction Tuning
- PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
- CM3Leon: Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
- Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
- OWLv2: Scaling Open-Vocabulary Object Detection
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
- Evaluating Object Hallucination in Large Vision-Language Models
- Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
- VideoGPT: Video Generation using VQ-VAE and Transformers
- MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
- Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
- UNITER: UNiversal Image-TExt Representation Learning
- MIMIC-IT: Multi-Modal In-Context Instruction Tuning
- FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions
- Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
- Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models
| | |