CRCV | Center for Research in Computer Vision
University of Central Florida
4328 Scorpius St.
HEC 245D
Orlando, FL 32816-2365
Phone: (407) 823-5077
Fax: (407) 823-0594
Email: shah@crcv.ucf.edu
UCF Graduate Faculty Profile: Click here
2024
Wu, Junyi; Wang, Haoxuan; Shang, Yuzhang; Shah, Mubarak; Yan, Yan
PTQ4DiT: Post-training Quantization for Diffusion Transformers Conference
Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024.
@conference{Wu2024,
title = {PTQ4DiT: Post-training Quantization for Diffusion Transformers},
author = {Junyi Wu and Haoxuan Wang and Yuzhang Shang and Mubarak Shah and Yan Yan},
url = {https://nips.cc/virtual/2024/poster/95445
https://arxiv.org/pdf/2405.16005
https://github.com/adreamwu/PTQ4DiT},
year = {2024},
date = {2024-12-13},
urldate = {2024-12-13},
publisher = {Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS)},
abstract = {The recent introduction of Diffusion Transformers (DiTs) has demonstrated exceptional capabilities in image generation by using a different backbone architecture, departing from traditional U-Nets and embracing the scalable nature of transformers. Despite their advanced capabilities, the wide deployment of DiTs, particularly for real-time applications, is currently hampered by considerable computational demands at the inference stage. Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint by using low-bit weights and activations. However, its applicability to DiTs has not yet been explored and faces non-trivial difficulties due to the unique design of DiTs. In this paper, we propose PTQ4DiT, a specifically designed PTQ method for DiTs. We discover two primary quantization challenges inherent in DiTs, notably the presence of salient channels with extreme magnitudes and the temporal variability in distributions of salient activation over multiple timesteps. To tackle these challenges, we propose Channel-wise Salience Balancing (CSB) and Spearmen's
ρ
-guided Salience Calibration (SSC). CSB leverages the complementarity property of channel magnitudes to redistribute the extremes, alleviating quantization errors for both activations and weights. SSC extends this approach by dynamically adjusting the balanced salience to capture the temporal variations in activation. Additionally, to eliminate extra computational costs caused by PTQ4DiT during inference, we design an offline re-parameterization strategy for DiTs. Experiments demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
ρ
-guided Salience Calibration (SSC). CSB leverages the complementarity property of channel magnitudes to redistribute the extremes, alleviating quantization errors for both activations and weights. SSC extends this approach by dynamically adjusting the balanced salience to capture the temporal variations in activation. Additionally, to eliminate extra computational costs caused by PTQ4DiT during inference, we design an offline re-parameterization strategy for DiTs. Experiments demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time.
Kumar, Aakash; Chen, Chen; Mian, Ajmal; Lobo, Neils; Shah, Mubarak
Sparse Points to Dense Clouds: Enhancing 3D Detection with Limited LiDAR Data Conference
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024.
@conference{Kumar2024,
title = {Sparse Points to Dense Clouds: Enhancing 3D Detection with Limited LiDAR Data},
author = {Aakash Kumar and Chen Chen and Ajmal Mian and Neils Lobo and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/2404.06715v1.pdf
https://aakashjuseja-aj.github.io/Sparse_to_Dense/},
year = {2024},
date = {2024-10-14},
publisher = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
abstract = {3D detection is a critical task that enables machines to identify and locate objects in three-dimensional space. It has a broad range of applications in several fields, including autonomous driving, robotics and augmented reality. Monocular 3D detection is attractive as it requires only a single camera, however, it lacks the accuracy and robustness required for real world applications. High resolution LiDAR on the other hand, can be expensive and lead to interference problems in heavy traffic given their active transmissions. We propose a balanced approach that combines the advantages of monocular and point cloud-based 3D detection. Our method requires only a small number of 3D points, that can be obtained from a low-cost, low-resolution sensor. Specifically, we use only 512 points, which is just 1% of a full LiDAR frame in the KITTI dataset. Our method reconstructs a complete 3D point cloud from this limited 3D information combined with a single image. The reconstructed 3D point cloud and corresponding image can be used by any multi-modal off-the-shelf detector for 3D object detection. By using the proposed network architecture with an off-the-shelf multi-modal 3D detector, the accuracy of 3D detection improves by 20% compared to the state-of-the-art monocular detection methods and 6% to 9% compare to the baseline multi-modal methods on KITTI and JackRabbot datasets.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Chhipa, Prakash Chandra; Chippa, Meenakshi Subhash; De, Kanjar; Saini, Rajkumar; Liwicki, Marcus; Shah, Mubarak
Möbius Transform for Mitigating Perspective Distortions in Representation Learning Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{nokey,
title = {Möbius Transform for Mitigating Perspective Distortions in Representation Learning},
author = {Prakash Chandra Chhipa and Meenakshi Subhash Chippa and Kanjar De and Rajkumar
Saini and Marcus Liwicki and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/MPD_ECCV2024_CameraReady.pdf
https://prakashchhipa.github.io/projects/mpd
https://youtu.be/MKh9NE_XEMY
https://www.crcv.ucf.edu/wp-content/uploads/2018/11/MPD_presentation_ECCV2024_final.pdf},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {Perspective distortion (PD) causes unprecedented changes in shape, size, orientation, angles, and other spatial relationships of visual concepts in images. Precisely estimating camera intrinsic and extrinsic parameters is a challenging task that prevents synthesizing perspective distortion. Non-availability of dedicated training data poses a critical barrier to developing robust computer vision methods. Additionally, distortion correction methods make other computer vision tasks a multi-step approach and lack performance. In this work, we propose
mitigating perspective distortion (MPD) by employing a fine-grained parameter control on a specific family of Möbius transform to model real-world distortion without estimating camera intrinsic and extrinsic parameters and without the need for actual distorted data. Also, we present a dedicated perspectively distorted benchmark dataset, ImageNet-PD, to benchmark the robustness of deep learning models against this new dataset. The proposed method outperforms existing benchmarks, ImageNet-E and ImageNet-X. Additionally, it significantly
improves performance on ImageNet-PD while consistently performing on standard data distribution. Notably, our method shows improved performance on three PD-affected real-world applications—crowd counting, fisheye image recognition, and person re-identification—and one PD-affected challenging CV task: object detection. The source code, dataset, and models are available on the project webpage at https://prakashchhipa.github.io/projects/mpd.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
mitigating perspective distortion (MPD) by employing a fine-grained parameter control on a specific family of Möbius transform to model real-world distortion without estimating camera intrinsic and extrinsic parameters and without the need for actual distorted data. Also, we present a dedicated perspectively distorted benchmark dataset, ImageNet-PD, to benchmark the robustness of deep learning models against this new dataset. The proposed method outperforms existing benchmarks, ImageNet-E and ImageNet-X. Additionally, it significantly
improves performance on ImageNet-PD while consistently performing on standard data distribution. Notably, our method shows improved performance on three PD-affected real-world applications—crowd counting, fisheye image recognition, and person re-identification—and one PD-affected challenging CV task: object detection. The source code, dataset, and models are available on the project webpage at https://prakashchhipa.github.io/projects/mpd.
Kang, Weitai; Liu, Gaowen; Shah, Mubarak; Yan, Yan
SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Kang2024,
title = {SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding},
author = {Weitai Kang and Gaowen Liu and Mubarak Shah and Yan Yan},
url = {https://arxiv.org/pdf/2407.03200
},
doi = {https://doi.org/10.48550/arXiv.2407.03200},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {Different from Object Detection, Visual Grounding deals with detecting a bounding box for each text-image pair. This one box for each text-image data provides sparse supervision signals. Although previous works achieve impressive results, their passive utilization of annotation, i.e. the sole use of the box annotation as regression ground truth, results in a suboptimal performance. In this paper, we present SegVG, a novel method transfers the box-level annotation as Segmentation signals to provide an additional pixel-level supervision for Visual Grounding. Specifically, we propose the Multi-layer Multi-task Encoder-Decoder as the target grounding stage, where we learn a regression query and multiple segmentation queries to ground the target by regression and segmentation of the box in each decoding layer, respectively. This approach allows us to iteratively exploit the annotation as signals for both box-level regression and pixel-level segmentation. Moreover, as the backbones are typically initialized by pretrained parameters learned from unimodal tasks and the queries for both regression and segmentation are static learnable embeddings, a domain discrepancy remains among these three types of features, which impairs subsequent target grounding. To mitigate this discrepancy, we introduce the Triple Alignment module, where the query, text, and vision tokens are triangularly updated to share the same space by triple attention mechanism. Extensive experiments on five widely used datasets validate our state-of-the-art (SOTA) performance.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Dave, Ishan Rajendrakumar; Rizve, Mamshad Nayeem; Shah, Mubarak
FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Dave2024,
title = {FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition},
author = {Ishan Rajendrakumar Dave and Mamshad Nayeem Rizve and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/finepsuedo_eccv24_dave.pdf
https://daveishan.github.io/finepsuedo-webpage/
https://youtu.be/bWOd8_JpjQs?si=WWRDhdg5ADWL0uwB},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {Real-life applications of action recognition often require a
fine-grained understanding of subtle movements, e.g., in sports analytics, user interactions in AR/VR, and surgical videos. Although fine-grained actions are more costly to annotate, existing semi-supervised action recognition has mainly focused on coarse-grained action recognition. Since fine-grained actions are more challenging due to the absence of scene bias, classifying these actions requires an understanding of action-phases. Hence, existing coarse-grained semi-supervised methods
do not work effectively. In this work, we for the first time thoroughly investigate semi-supervised fine-grained action recognition (FGAR). We observe that alignment distances like dynamic time warping (DTW) provide a suitable action-phase-aware measure for comparing fine-grained actions, a concept previously unexploited in FGAR. However, since regular DTW distance is pairwise and assumes strict alignment between pairs, it is not directly suitable for classifying fine-grained actions. To
utilize such alignment distances in a limited-label setting, we propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs. Our learnable alignability score provides a better phase-aware measure, which we use to refine the pseudo-labels of the primary video encoder. Our collaborative pseudolabeling-based framework ‘FinePseudo’ significantly outperforms prior methods on four fine-grained action recognition datasets: Diving48, FineGym99, FineGym288, and FineDiving, and shows improvement on existing coarse-grained datasets: Kinetics400 and Something-SomethingV2. We also demonstrate the robustness of our collaborative pseudo-labeling in handling novel unlabeled classes in open-world semi-supervised setups.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
fine-grained understanding of subtle movements, e.g., in sports analytics, user interactions in AR/VR, and surgical videos. Although fine-grained actions are more costly to annotate, existing semi-supervised action recognition has mainly focused on coarse-grained action recognition. Since fine-grained actions are more challenging due to the absence of scene bias, classifying these actions requires an understanding of action-phases. Hence, existing coarse-grained semi-supervised methods
do not work effectively. In this work, we for the first time thoroughly investigate semi-supervised fine-grained action recognition (FGAR). We observe that alignment distances like dynamic time warping (DTW) provide a suitable action-phase-aware measure for comparing fine-grained actions, a concept previously unexploited in FGAR. However, since regular DTW distance is pairwise and assumes strict alignment between pairs, it is not directly suitable for classifying fine-grained actions. To
utilize such alignment distances in a limited-label setting, we propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs. Our learnable alignability score provides a better phase-aware measure, which we use to refine the pseudo-labels of the primary video encoder. Our collaborative pseudolabeling-based framework ‘FinePseudo’ significantly outperforms prior methods on four fine-grained action recognition datasets: Diving48, FineGym99, FineGym288, and FineDiving, and shows improvement on existing coarse-grained datasets: Kinetics400 and Something-SomethingV2. We also demonstrate the robustness of our collaborative pseudo-labeling in handling novel unlabeled classes in open-world semi-supervised setups.
Swetha, Sirnam; Yang, Jinyu; Neiman, Tal; Rizve, Mamshad Nayeem; Tran, Son; Yao, Benjamin; Chilimbi, Trishul; Shah, Mubarak
X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Swetha2024,
title = {X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs},
author = {Sirnam Swetha and Jinyu Yang and Tal Neiman and Mamshad Nayeem Rizve and Son Tran and Benjamin Yao and Trishul Chilimbi and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/2407.13851v1.pdf
https://arxiv.org/abs/2407.13851
https://swetha5.github.io/XFormer/},
doi = {https://doi.org/10.48550/arXiv.2407.13851},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {Recent advancements in Multimodal Large Language Models (MLLMs) have revolutionized the field of vision-language understanding by integrating visual perception capabilities into Large Language Models (LLMs). The prevailing trend in this field involves the utilization of a vision encoder derived from vision-language contrastive learning (CL), showing expertise in capturing overall representations while facing difficulties in capturing detailed local patterns. In this work, we focus on enhancing the visual representations for MLLMs by combining high-frequency and detailed visual representations, obtained through masked image modeling (MIM), with semantically-enriched low-frequency representations captured by CL. To achieve this goal, we introduce X-Former which is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM through an innovative interaction mechanism. Specifically, X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders, i.e., CLIP-ViT (CL-based) and MAE-ViT (MIM-based). It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM. To demonstrate the effectiveness of our approach, we assess its performance on tasks demanding detailed visual understanding. Extensive evaluations indicate that X-Former excels in visual reasoning tasks involving both structural and semantic categories in the GQA dataset. Assessment on fine-grained visual perception benchmark further confirms its superior capabilities in visual understanding.
},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Gupta, Rohit; Rizve, Mamshad Nayeem; Tawari, Ashish; Unnikrishnan, Jayakrishnan; Tran, Son; Shah, Mubarak; benjamin Yao,; Chilimbi, Trishul
Open Vocabulary Multi-Label Video Classification Conference
2024.
@conference{Gupta2024,
title = {Open Vocabulary Multi-Label Video Classification},
author = {Rohit Gupta and Mamshad Nayeem Rizve and Ashish Tawari and Jayakrishnan Unnikrishnan and Son Tran and Mubarak Shah and benjamin Yao and Trishul Chilimbi},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/OVMLVidCLS_ECCV_2024_CameraReady-2.pdf
https://www.crcv.ucf.edu/wp-content/uploads/2018/11/OVMLVidCLS_ECCV_2024_Supplementary.pdf
https://arxiv.org/html/2407.09073v1#S1},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
abstract = {Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multi-label video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP’s vision encoder to effectively model the spatio-temporal dynamics of video concepts as well as propose a novel regularized finetuning technique to ensure strong open vocabulary classification performance in the video domain. Our extensive experimentation showcases the efficacy of our approach on multiple benchmark datasets.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Dave, Ishan Rajendrakumar; Heilbron, Fabian Caba; Shah, Mubarak; Jenni, Simon
Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets Conference
The 18th European Conference on Computer Vision ECCV 2024, Oral (Top 3%), 2024.
@conference{Dave2024b,
title = {Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets},
author = {Ishan Rajendrakumar Dave and Fabian Caba Heilbron and Mubarak Shah and Simon Jenni},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/avr_eccv24_dave.pdf
https://daveishan.github.io/avr-webpage/
https://youtu.be/6euQwz7XdQk?si=v12dQH4e7UrTUrIU},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024, Oral (Top 3%)},
abstract = {Temporal video alignment aims to synchronize the key events
like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approaches operate under the restrictive assumption that a suitable video pair for alignment is given, significantly limiting their broader applicability. To address this, we re-pose temporal alignment as a search problem and introduce the task of Alignable Video Retrieval (AVR). Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query. To achieve this, we make three key contributions: 1) we introduce DRAQ, a video alignability indicator to identify and re-rank the best alignable video from a set of candidates; 2) we propose an effective and generalizable frame-level video feature design to improve the alignment performance of several off-the-shelf feature representations, and 3) we propose a novel benchmark and evaluation protocol for AVR using cycle-consistency metrics. Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach in identifying alignable video pairs from diverse datasets. },
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approaches operate under the restrictive assumption that a suitable video pair for alignment is given, significantly limiting their broader applicability. To address this, we re-pose temporal alignment as a search problem and introduce the task of Alignable Video Retrieval (AVR). Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query. To achieve this, we make three key contributions: 1) we introduce DRAQ, a video alignability indicator to identify and re-rank the best alignable video from a set of candidates; 2) we propose an effective and generalizable frame-level video feature design to improve the alignment performance of several off-the-shelf feature representations, and 3) we propose a novel benchmark and evaluation protocol for AVR using cycle-consistency metrics. Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach in identifying alignable video pairs from diverse datasets.
Yang, Paiyu; Akhtar, Naveed; Shah, Mubarak; Mian, Ajmal
Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
BibTeX | Links:
@conference{Yang2024,
title = {Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density},
author = {Paiyu Yang and Naveed Akhtar and Mubarak Shah and Ajmal Mian},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/main_paper-1.pdf
https://arxiv.org/pdf/2407.04370
https://github.com/ypeiyu/input_density_reg},
year = {2024},
date = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Pillai, Manu S; Rizve, Mamshad Nayeem; Shah, Mubarak
GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Pillai2024,
title = {GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers},
author = {Manu S Pillai and Mamshad Nayeem Rizve and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/07875-supp.pdf
https://arxiv.org/abs/2408.02840
https://github.com/manupillai308/GAReT},
doi = {https://doi.org/10.48550/arXiv.2408.02840},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {Cross-view video geo-localization (CVGL) aims to derive GPS trajectories from street-view videos by aligning them with aerial-view images. Despite their promising performance, current CVGL methods face significant challenges. These methods use camera and odometry data, typically absent in real-world scenarios. They utilize multiple adjacent frames and various encoders for feature extraction, resulting in high computational costs. Moreover, these approaches independently predict each street-view frame's location, resulting in temporally inconsistent GPS trajectories. To address these challenges, in this work, we propose GAReT, a fully transformer-based method for CVGL that does not require camera and odometry data. We introduce GeoAdapter, a transformer-adapter module designed to efficiently aggregate image-level representations and adapt them for video inputs. Specifically, we train a transformer encoder on video frames and aerial images, then freeze the encoder to optimize the GeoAdapter module to obtain video-level representation. To address temporally inconsistent trajectories, we introduce TransRetriever, an encoder-decoder transformer model that predicts GPS locations of street-view frames by encoding top-k nearest neighbor predictions per frame and auto-regressively decoding the best neighbor based on the previous frame's predictions. Our method's effectiveness is validated through extensive experiments, demonstrating state-of-the-art performance on benchmark datasets.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Kulkarni, Parth Parag; Nayak, Guarav Kumar; Shah, Mubarak
CityGuessr: City-Level Video Geo-Localization on a Global Scale Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Kulkarni2024,
title = {CityGuessr: City-Level Video Geo-Localization on a Global Scale},
author = {Parth Parag Kulkarni and Guarav Kumar Nayak and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/CityGuessr.pdf
https://parthpk.github.io/cityguessr-webpage/
},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {Video geolocalization is a crucial problem in current times. Given just a video, ascertaining where it was captured from can have a plethora of advantages. The problem of worldwide geolocalization has been tackled before, but only using the image modality. Its video counterpart remains relatively unexplored. Meanwhile, video geolocalization has also garnered some attention in the recent past, but the existing methods are all restricted to specific regions. This motivates us to explore the problem of video geolocalization at a global scale. Hence, we propose a novel problem of worldwide video geolocalization with the objective of hierarchically predicting the correct city, state/province, country, and continent, given a video. However, no large scale video datasets that have extensive worldwide coverage exist, to train models for solving this problem. To this end, we introduce a new dataset, “CityGuessr68k” comprising of 68,269 videos from 166 cities all over the world. We also propose a novel baseline approach to this problem, by designing a transformerbased architecture comprising of an elegant “Self-Cross Attention” module for incorporating scenes as well as a “TextLabel Alignment” strategy for distilling knowledge from textlabels in feature space. To further enhance our location prediction, we also utilize soft-scene labels. Finally we demonstrate the performance of our method on our new dataset as well as Mapillary(MSLS) [38]. },
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Rizve, Mamshad Nayeem; Fei, Fan; Unnikrishnan, Jayakrishnan; Tran, Son; Yao, Benjamin Z.; Zeng, Belinda; Shah, Mubarak; Chilimbi, Trishul
VidLA: Video-Language Alignment at Scale Conference
IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024.
@conference{Rizve2024,
title = {VidLA: Video-Language Alignment at Scale},
author = {Mamshad Nayeem Rizve and Fan Fei and Jayakrishnan Unnikrishnan and Son Tran and Benjamin Z. Yao and Belinda Zeng and Mubarak Shah and Trishul Chilimbi},
url = {https://arxiv.org/abs/2403.14870},
year = {2024},
date = {2024-06-17},
urldate = {2024-06-17},
publisher = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {In this paper, we propose VidLA, an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First, they do not capture both short-range and long-range temporal dependencies and typically employ complex hierarchical deep network architectures that are hard to integrate with existing pretrained image-text foundation models. To effectively address this limitation, we instead keep the network architecture simple and use a set of data tokens that operate at different temporal resolutions in a hierarchical manner, accounting for the temporally hierarchical nature of videos. By employing a simple two-tower architecture, we are able to initialize our video-language model with pretrained image-text foundation models, thereby boosting the final performance. Second, existing video-language alignment works struggle due to the lack of semantically aligned large-scale training data. To overcome it, we leverage recent LLMs to curate the largest video-language dataset to date with better visual grounding. Furthermore, unlike existing video-text datasets which only contain short clips, our dataset is enriched with video clips of varying durations to aid our temporally hierarchical data tokens in extracting better representations at varying temporal scales. Overall, empirical results show that our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks, especially on longer videos, and performs competitively on classification benchmarks.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Ristea, Nicolae Catalin; Croitoru, Florinel Alin; Ionescu, Radu Tudor; Popescu, Marius; Khan, Fahad; Shah, Mubarak
Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors Conference
IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024.
@conference{Ristea2024,
title = {Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors},
author = {Nicolae Catalin Ristea and Florinel Alin Croitoru and Radu Tudor Ionescu and Marius Popescu and Fahad Khan and Mubarak Shah},
url = {https://arxiv.org/abs/2306.12041
https://github.com/ristea/aed-mae},
year = {2024},
date = {2024-06-17},
urldate = {2024-06-17},
publisher = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level. The novelty of the proposed model is threefold. First, we introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects. Second, we integrate a teacher decoder and a student decoder into our architecture, leveraging the discrepancy between the outputs given by the two decoders to improve anomaly detection. Third, we generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames (without anomalies) and the corresponding pixel-level anomaly maps. Our design leads to an efficient and effective model, as demonstrated by the extensive experiments carried out on four benchmarks: Avenue, ShanghaiTech, UBnormal and UCSD Ped2. The empirical results show that our model achieves an excellent trade-off between speed and accuracy, obtaining competitive AUC scores, while processing 1655 FPS. Hence, our model is between 8 and 70 times faster than competing methods. We also conduct an ablation study to justify our design. },
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Thawakar, Omkar Chakradhar; Naseer, Muzammal; Anwer, Rao Muhammad; Khan, Salman; Felsberg, Michael; Shah, Mubarak; Khan, Fahad
Composed Video Retrieval via Enriched Context and Discriminative Embeddings Conference
2024.
@conference{Thawakar2024,
title = {Composed Video Retrieval via Enriched Context and Discriminative Embeddings},
author = {Omkar Chakradhar Thawakar and Muzammal Naseer and Rao Muhammad Anwer and Salman Khan and Michael Felsberg and Mubarak Shah and Fahad Khan },
url = {https://arxiv.org/abs/2403.16997
https://github.com/OmkarThawakar/composed-video-retrieval},
year = {2024},
date = {2024-06-17},
urldate = {2024-06-17},
abstract = {Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases. Existing works predominantly rely on visual queries combined with modification text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art performance for both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in terms of recall@K=1 score.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Dutta, Aritra; Das, Srijan; Nielsen, Jacob; Chakraborty, Rajatsubhra; Shah, Mubarak
Multiview Aerial Visual RECognition (MAVREC) Dataset: Can Multi-view Improve Aerial Visual Perception? Conference
IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024.
@conference{Dutta2024,
title = {Multiview Aerial Visual RECognition (MAVREC) Dataset: Can Multi-view Improve Aerial Visual Perception?},
author = {Aritra Dutta and Srijan Das and Jacob Nielsen and Rajatsubhra Chakraborty and Mubarak Shah},
url = {https://arxiv.org/abs/2312.04548
https://mavrec.github.io/},
year = {2024},
date = {2024-06-17},
urldate = {2024-06-17},
publisher = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {Despite the commercial abundance of UAVs, aerial data acquisition remains challenging, and the existing Asia and North America-centric open-source UAV datasets are small-scale or low-resolution and lack diversity in scene contextuality. Additionally, the color content of the scenes, solar-zenith angle, and population density of different geographies influence the data diversity. These two factors conjointly render suboptimal aerial-visual perception of the deep neural network (DNN) models trained primarily on the ground-view data, including the open-world foundational models.
To pave the way for a transformative era of aerial detection, we present Multiview Aerial Visual RECognition or MAVREC, a video dataset where we record synchronized scenes from different perspectives -- ground camera and drone-mounted camera. MAVREC consists of around 2.5 hours of industry-standard 2.7K resolution video sequences, more than 0.5 million frames, and 1.1 million annotated bounding boxes. This makes MAVREC the largest ground and aerial-view dataset, and the fourth largest among all drone-based datasets across all modalities and tasks. Through our extensive benchmarking on MAVREC, we recognize that augmenting object detectors with ground-view images from the corresponding geographical location is a superior pre-training strategy for aerial detection. Building on this strategy, we benchmark MAVREC with a curriculum-based semi-supervised object detection approach that leverages labeled (ground and aerial) and unlabeled (only aerial) images to enhance the aerial detection.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
To pave the way for a transformative era of aerial detection, we present Multiview Aerial Visual RECognition or MAVREC, a video dataset where we record synchronized scenes from different perspectives -- ground camera and drone-mounted camera. MAVREC consists of around 2.5 hours of industry-standard 2.7K resolution video sequences, more than 0.5 million frames, and 1.1 million annotated bounding boxes. This makes MAVREC the largest ground and aerial-view dataset, and the fourth largest among all drone-based datasets across all modalities and tasks. Through our extensive benchmarking on MAVREC, we recognize that augmenting object detectors with ground-view images from the corresponding geographical location is a superior pre-training strategy for aerial detection. Building on this strategy, we benchmark MAVREC with a curriculum-based semi-supervised object detection approach that leverages labeled (ground and aerial) and unlabeled (only aerial) images to enhance the aerial detection.
Garcia, Gustavo; Aparcedo, Alejandro; Nayak, Gaurav Kumar; Ahmed, Tanvir; Shah, Mubarak; Li, Mengjie
Generalized Deep Learning Model for Photovoltaic Module Segmentation from Satellite and Aerial Imagery Journal Article
In: Solar Energy, vol. 274, 2024.
@article{Garcia2024,
title = {Generalized Deep Learning Model for Photovoltaic Module Segmentation from Satellite and Aerial Imagery},
author = {Gustavo Garcia and Alejandro Aparcedo and Gaurav Kumar Nayak and Tanvir Ahmed and Mubarak Shah and Mengjie Li },
url = {https://pdf.sciencedirectassets.com/271459/1-s2.0-S0038092X24X00079/1-s2.0-S0038092X24002330/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEMv%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJHMEUCIQCVMhrN8N9Mo8ub2VN%2FLBj9DPWsBs1Uf%2BzyhKBcOqdPGAIgHhTnnbthASSvVlxCMZHsZC5JCFcIZd3qGLeGWHhgKvwqswUIQxAFGgwwNTkwMDM1NDY4NjUiDB8Qpz5z2GubRu8x%2FiqQBf%2FN1Z2xCIaA5NTNXGx8ANCdIOgu3D20kuxLGIkfKklv3zhyEPLQVK%2B4hCKa52lKiGv3JP%2BB%2FFdLMfraPJnrAFuV6CSp%2BflZkPOun24m1Yx1WR86vHTBuOYvaqeTUGvE7eAC5y9sTozkJtU8peCWWLpEIXihdn2CF4UbRDViDvelnSH%2B439ogLqRT54izMVpgbGB0JnDssmjwlP%2BXmDwzY0NMJYZW3FQMLUKLpSCeuTQW9waNXqmGGLpjeXx%2FZ6BDQa8h9JEQedcp%2BjmSHNd5pbVxMcor0e6nAn3A7LAwmowxH%2BF8fy8CUNUWCRNUPvr%2FHH%2Fna4UVeGteRZCU5fdH6ttPUppcdddliK456nGCW7DDlU%2FDBAfGKY5QTPx6OcuP1uiJK6pKXjCk%2B3Z3j3%2BdBc2kx1sYtOcmoZybLonmsEqu4yyT8B86RFVIp%2BkPpY4jOfSzhzO5in7t7WRJQ%2BZPk16wbrAI%2FmwYcMZpGkGOnBNLbPzsGJNiQ%2FqfXZmqW%2B1tyy8TbPSvwrsyckMwQEBzQ2JVKxVDi2RWx079ig07ZgM%2FhY7OP4Ahkf%2BUxqzhjksQOUODLdIkbxFFuPwnRBAf8FCkG3QoI5756LjTduaH3o6CEO1O95wAyRCJ4gtQrA9ScsnIGwmeEu25M6UwAKXHudGQFKBKEWrV1eBjYwRa3656juKxOfhPu83RZAPYQp0EQ9EnmlY%2BEpbpHWSr4kOzw3DMIe5qxKvoBREgyd1NwK2a1yWkGY3BUTgnvEwy5cDfbhqafy5e2PdioY4HuO14w4sbx0dr%2BQj3uuZK0YAGSplwgLtH0iLmr5JiUl%2FjgoOOJf7s7ABX9EcTgTmTPZ3phq2JjIMKHTCwoz8L54eXPzsMLzosbIGOrEBOdb1Nvf4ZJ5QRr6dK4oz6b0brYE8KdZXrBSpnbWEVXByzClWX3f3F1J438Pr6FfkYAvbk6fKq4bmZroVmiXd%2B%2BBdct6REaFgwxgT%2Fx%2Fp0TGEeRs%2B1VKSVVNjyptfnbeX339CmX98ZYH6RYI3lMAfVLlNkeEBiyMOXcLUsv9dR%2Bq9%2BYxLvJyOZ%2FVqxJgRk6C6%2FYHYsInJ2wtTcOMv8FlXf0UHszkIrDhAmhTUuy8ToY5Z&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240521T112839Z&X-Amz-SignedHeaders=host&X-Amz-Expires=300&X-Amz-Credential=ASIAQ3PHCVTYU4QHE644%2F20240521%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=f0328a0d6c68d8d08f3564ab95dec786dd50a2b354ade5faf8c58d328989db49&hash=8a2b0bb6cf41964c98f9da26c87972608e6bb23caff806e9b2ae6f5386ef05bd&host=68042c943591013ac2b2430a89b270f6af2c76d8dfd086a07176afe7c76c2c61&pii=S0038092X24002330&tid=spdf-b6546ec0-1fda-4ac2-99f2-ad4826de3841&sid=f672770a5282964a827afab6e60c14d3690bgxrqa&type=client&tsoh=d3d3LnNjaWVuY2VkaXJlY3QuY29t&ua=11145e5955535c5658&rr=887439686d790335&cc=us
https://www.sciencedirect.com/science/article/pii/S0038092X24002330},
doi = {https://doi.org/10.1016/j.solener.2024.112539},
year = {2024},
date = {2024-05-15},
journal = {Solar Energy},
volume = {274},
abstract = {As solar photovoltaic (PV) has emerged as a dominant player in the energy market, there has been an exponential surge in solar deployment and investment within this sector. With the rapid growth of solar energy adoption, accurate and efficient detection of PV panels has become crucial for effective solar energy mapping and planning. This paper presents the application of the Mask2Former model for segmenting PV panels from a diverse, multi-resolution dataset of satellite and aerial imagery. Our primary objective is to harness Mask2Former’s deep learning capabilities to achieve precise segmentation of PV panels in real-world scenarios. We fine-tune the pre-existing Mask2Former model on a carefully curated multi-resolution dataset and a crowdsourced dataset of satellite and aerial images, showcasing its superiority over other deep learning models like U-Net and DeepLabv3+. Most notably, Mask2Former establishes a new state-of-the-art in semantic segmentation by achieving over 95% IoU scores. Our research contributes significantly to the advancement solar energy mapping and sets a benchmark for future studies in this field.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Dave, Ishan Rajendrakumar; de Blegiers, Tristan; Chen, Chen; Shah, Mubarak
CodaMal: Contrastive Domain Adaptation for Malaria Detection in Low-Cost Microscopes Conference
31st IEEE International Conference on Image Processing (ICIP) (Oral), 2024.
BibTeX | Links:
@conference{Dave2024c,
title = {CodaMal: Contrastive Domain Adaptation for Malaria Detection in Low-Cost Microscopes },
author = {Ishan Rajendrakumar Dave and Tristan de Blegiers and Chen Chen and Mubarak Shah},
url = {https://arxiv.org/pdf/2402.10478
https://daveishan.github.io/codamal-webpage/},
year = {2024},
date = {2024-02-16},
publisher = {31st IEEE International Conference on Image Processing (ICIP) (Oral)},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
2023
Dave, Ishan Rajendrakumar; Jenni, Simon; Shah, Mubarak
No More Shortcuts: Realizing the Potential of Temporal Self-Supervision Conference
AAAI Conference on Artificial Intelligence (AAAI), Main Technical Track, 2023.
BibTeX | Links:
@conference{nokey,
title = {No More Shortcuts: Realizing the Potential of Temporal Self-Supervision},
author = {Ishan Rajendrakumar Dave and Simon Jenni and Mubarak Shah},
url = {https://arxiv.org/pdf/2312.13008
https://daveishan.github.io/nms-webpage/
https://youtu.be/5MBnxmMBQh0},
year = {2023},
date = {2023-12-20},
publisher = {AAAI Conference on Artificial Intelligence (AAAI), Main Technical Track},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Cepeda, Vicente Vivanco; Nayak, Gaurav Kumar; Shah, Mubarak
GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization Conference
Thirty-seventh Conference on Neural Information Processing Systems, 2023.
@conference{Cepeda2023,
title = {GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization},
author = {Vicente Vivanco Cepeda and Gaurav Kumar Nayak and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/GeoCLIP_camera_ready_paper.pdf
https://www.crcv.ucf.edu/wp-content/uploads/2018/11/GeoCLIP_camera_ready_supplementary.pdf
https://vicentevivan.github.io/GeoCLIP/},
year = {2023},
date = {2023-12-11},
publisher = {Thirty-seventh Conference on Neural Information Processing Systems},
abstract = {Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth. This task has considerable challenges due to immense variation in geographic landscapes. The image-to-image retrieval-based approaches fail to solve this problem on a global scale as it is not feasible to construct a large gallery of images covering the entire world. Instead, existing approaches divide the globe into discrete geographic cells, transforming the problem into a classification task. However, their performance is limited by the predefined classes and often results in inaccurate localizations when an image’s location significantly deviates from its class center. To overcome these limitations, we propose GeoCLIP, a novel CLIP-inspired Image-to-GPS retrieval approach that enforces alignment between the image and its corresponding GPS locations. GeoCLIP’s location encoder models the Earth as a continuous function by employing positional encoding through random Fourier features and constructing a hierarchical representation that captures information at varying resolutions to yield a semantically rich highdimensional feature suitable to use even beyond geo-localization. To the best of our knowledge, this is the first work employing GPS encoding for geo-localization. We demonstrate the efficacy of our method via extensive experiments and ablations on benchmark datasets. We achieve competitive performance with just 20% of training data, highlighting its effectiveness even in limited-data settings. Furthermore, we qualitatively demonstrate geo-localization using a text query by leveraging CLIP backbone of our image encoder.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Kini, Jyoti; Khan, Fahad Shahbaz; Khan, Salman; Shah, Mubarak
CT-VOS: Cutout Prediction and Tagging for Self-Supervised Video Object Segmentation Journal Article
In: Computer Vision and Image Understanding, 2023.
@article{Kini2023c,
title = {CT-VOS: Cutout Prediction and Tagging for Self-Supervised Video Object Segmentation},
author = {Jyoti Kini and Fahad Shahbaz Khan and Salman Khan and Mubarak Shah},
year = {2023},
date = {2023-10-09},
journal = {Computer Vision and Image Understanding},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Hanif, Asif; Naseer, Muzammal; Khan, Salman; Shah, Mubarak; Khan, Fahad Shahbaz
Frequency Domain Adversarial Training for Robust Volumetric Medical Segmentation Conference
The 26th International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI 2023, 2023.
BibTeX | Links:
@conference{nokey,
title = {Frequency Domain Adversarial Training for Robust Volumetric Medical Segmentation},
author = {Asif Hanif and Muzammal Naseer and Salman Khan and Mubarak Shah and Fahad Shahbaz Khan},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/Frequency-Domain-Adversarial-Training-for-Robust-Volumetric-Medical-Segmentation.pdf
https://github.com/asif-hanif/vafa},
doi = {https://doi.org/10.48550/arXiv.2307.07269},
year = {2023},
date = {2023-10-08},
publisher = {The 26th International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI 2023},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Vahidian, Saeed; Kadaveru, Sreevatsank; Baek, Woonjoon; Wang, Weijia; Kungurtsev, Vyacheslav; Chen, Chen; Shah, Mubarak; Lin, Bill
When Do Curricula Work in Federated Learning? Conference
IEEE/CVF International Conference on Computer Vision, 2023.
@conference{Vahidian2023b,
title = {When Do Curricula Work in Federated Learning? },
author = {Saeed Vahidian and Sreevatsank Kadaveru and Woonjoon Baek and Weijia Wang and Vyacheslav Kungurtsev and Chen Chen and Mubarak Shah and Bill Lin},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/2212.12712.pdf
https://arxiv.org/abs/2212.12712},
doi = {https://doi.org/10.48550/arXiv.2212.12712},
year = {2023},
date = {2023-10-02},
urldate = {2023-10-02},
publisher = {IEEE/CVF International Conference on Computer Vision},
abstract = {An oft-cited open problem of federated learning is the existence of data heterogeneity at the clients. One pathway to understanding the drastic accuracy drop in federated learning is by scrutinizing the behavior of the clients' deep models on data with different levels of "difficulty", which has been left unaddressed. In this paper, we investigate a different and rarely studied dimension of FL: ordered learning. Specifically, we aim to investigate how ordered learning principles can contribute to alleviating the heterogeneity effects in FL. We present theoretical analysis and conduct extensive empirical studies on the efficacy of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random curriculum. We find that curriculum learning largely alleviates non-IIDness. Interestingly, the more disparate the data distributions across clients the more they benefit from ordered learning. We provide analysis explaining this phenomenon, specifically indicating how curriculum training appears to make the objective landscape progressively less convex, suggesting fast converging iterations at the beginning of the training procedure. We derive quantitative results of convergence for both convex and nonconvex objectives by modeling the curriculum training on federated devices as local SGD with locally biased stochastic gradients. Also, inspired by ordered learning, we propose a novel client selection technique that benefits from the real-world disparity in the clients. Our proposed approach to client selection has a synergic effect when applied together with ordered learning in FL.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Samarasinghe, Sarinda; Nayeem, Mamshad; Kardan, Rizve Navid; Shah, Mubarak
CDFSL-V: Cross-Domain Few-Shot Learning for Videos Conference
IEEE/CVF International Conference on Computer Vision, 2023.
@conference{Samarasinghe2023,
title = {CDFSL-V: Cross-Domain Few-Shot Learning for Videos},
author = {Sarinda Samarasinghe and Mamshad Nayeem and Rizve Navid Kardan and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/CDFSL_Video_Combined_Final.pdf
https://sarinda251.github.io/CDFSL-V-site/
https://www.youtube.com/watch?v=RdlEzfW013o},
year = {2023},
date = {2023-10-02},
urldate = {2023-10-02},
publisher = {IEEE/CVF International Conference on Computer Vision},
abstract = {Few-shot video action recognition is an effective approach to recognizing new categories with only a few labeled examples, thereby reducing the challenges associated with collecting and annotating large-scale video datasets. Existing methods in video action recognition rely on large labeled datasets from the same domain. However, this setup is not realistic as novel categories may come from different data domains that may have different spatial and temporal characteristics. This dissimilarity between the source and target domains can pose a significant challenge, rendering traditional few-shot action recognition techniques ineffective. To address this issue, in this work, we propose a novel cross-domain few-shot video action recognition method that leverages self-supervised learning and curriculum learning to balance the information from the source and target domains. To be particular, our method employs a masked autoencoder-based self-supervised training objective to learn from both source and target data in a self-supervised manner. Then a progressive curriculum balances learning the discriminative information from the source dataset with the generic information learned from the target domain. Initially, our curriculum utilizes supervised learning to learn class discriminative features from the source data. As the training progresses, we transition to learning target-domain-specific features. We propose a progressive curriculum to encourage the emergence of rich features in the target domain based on class discriminative supervised features in the source domain. %a schedule that helps with this transition. We evaluate our method on several challenging benchmark datasets and demonstrate that our approach outperforms existing cross-domain few-shot learning techniques.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Sirnam, Swetha; Rizve, Mamshad Nayeem; Kuhne, Hilde; Shah, Mubarak
Preserving Modality Structure Improves Multi-Modal Learning Conference
IEEE/CVF International Conference on Computer Vision, 2023.
@conference{nokey,
title = {Preserving Modality Structure Improves Multi-Modal Learning },
author = {Swetha Sirnam and Mamshad Nayeem Rizve and Hilde Kuhne and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/2308.13077.pdf
https://arxiv.org/abs/2308.13077
https://github.com/Swetha5/Multi_Sinkhorn_Knopp
https://swetha5.github.io/MultiSK/
https://youtu.be/1CrGkUATy50
},
doi = {https://doi.org/10.48550/arXiv.2308.13077},
year = {2023},
date = {2023-10-02},
urldate = {2023-10-02},
publisher = {IEEE/CVF International Conference on Computer Vision},
abstract = {Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings in a joint multi-modal representation space without relying on human annotations. These joint embeddings enable zero-shot cross-modal tasks like retrieval and classification. However, these methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings. In this context, we propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space. To capture modality-specific semantic relationships between samples, we propose to learn multiple anchors and represent the multifaceted relationship between samples with respect to their relationship with these anchors. To assign multiple anchors to each sample, we propose a novel Multi-Assignment Sinkhorn-Knopp algorithm. Our experimentation demonstrates that our proposed approach learns semantically meaningful anchors in a self-supervised manner. Furthermore, our evaluation on MSR-VTT and YouCook2 datasets demonstrates that our proposed multi-anchor assignment based solution achieves state-of-the-art performance and generalizes to both inand out-of-domain datasets. Code: https://github.com/Swetha5/Multi_Sinkhorn_Knopp},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Liu, Daochang; Li, Qiyue; Dinh, Anh-Dung; Jiang, Tingting; Shah, Mubarak; Xu, Chang
Diffusion Action Segmentation Conference
IEEE/CVF International Conference on Computer Vision, 2023.
@conference{Liu2023b,
title = {Diffusion Action Segmentation},
author = {Daochang Liu and Qiyue Li and Anh-Dung Dinh and Tingting Jiang and Mubarak Shah and Chang Xu},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/2303.17959.pdf
https://arxiv.org/abs/2303.17959
https://finspire13.github.io/DiffAct-Project-Page/
https://github.com/Finspire13/DiffAct
https://youtu.be/o_Jp8shth7U
https://www.crcv.ucf.edu/wp-content/uploads/2018/11/Slides.pptx},
doi = { https://doi.org/10.48550/arXiv.2303.17959},
year = {2023},
date = {2023-10-02},
urldate = {2023-10-02},
publisher = {IEEE/CVF International Conference on Computer Vision},
abstract = {Temporal action segmentation is crucial for understanding long-form videos. Previous works on this task commonly adopt an iterative refinement paradigm by using multi-stage models. We propose a novel framework via denoising diffusion models, which nonetheless shares the same inherent spirit of such iterative refinement. In this framework, action predictions are iteratively generated from random noise with input video features as conditions. To enhance the modeling of three striking characteristics of human actions, including the position prior, the boundary ambiguity, and the relational dependency, we devise a unified masking strategy for the conditioning inputs in our framework. Extensive experiments on three benchmark datasets, i.e., GTEA, 50Salads, and Breakfast, are performed and the proposed method achieves superior or comparable results to state-of-the-art methods, showing the effectiveness of a generative approach for action segmentation.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Fioresi, Joseph; Dave, Ishan; Shah, Mubarak
TeD-SPAD: Temporal Distinctiveness for Self-supervised Privacy-preservation for video Anomaly Detection Conference
IEEE/CVF International Conference on Computer Vision, 2023.
BibTeX | Links:
@conference{Fioresi2023,
title = {TeD-SPAD: Temporal Distinctiveness for Self-supervised Privacy-preservation for video Anomaly Detection},
author = {Joseph Fioresi and Ishan Dave and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/2308.11072.pdf
https://arxiv.org/abs/2308.11072
https://github.com/UCF-CRCV/TeD-SPAD
https://joefioresi718.github.io/TeD-SPAD_webpage/
https://www.crcv.ucf.edu/wp-content/uploads/2018/11/TeDSPAD_ICCV_poster.pdf
https://youtu.be/3a9qeJUD1GU},
year = {2023},
date = {2023-10-02},
urldate = {2023-10-02},
publisher = {IEEE/CVF International Conference on Computer Vision},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Wasim, Syed Talal; Khattak, Muhammad Uzair; Naseer, Muzammal; Khan, Salman; Shah, Mubarak; Khan, Fahad Shahbaz
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition Conference
IEEE/CVF International Conference on Computer Vision, 2023.
@conference{nokey,
title = {Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition },
author = {Syed Talal Wasim and Muhammad Uzair Khattak and Muzammal Naseer and Salman Khan and Mubarak Shah and Fahad Shahbaz Khan },
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/2307.06947.pdf
https://arxiv.org/abs/2307.06947
https://talalwasim.github.io/Video-FocalNets/
https://github.com/TalalWasim/Video-FocalNets
https://talalwasim.github.io/Video-FocalNets/#BibTeX},
doi = { https://doi.org/10.48550/arXiv.2307.06947},
year = {2023},
date = {2023-10-02},
urldate = {2023-10-02},
publisher = {IEEE/CVF International Conference on Computer Vision},
abstract = {Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work proposes Video-FocalNet, an effective and efficient architecture for video recognition that models both local and global contexts. Video-FocalNet is based on a spatio-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention for better efficiency. Further, the aggregation step and the interaction step are both implemented using efficient convolution and element-wise multiplication operations that are computationally less expensive than their self-attention counterparts on video representations. We extensively explore the design space of focal modulation-based spatio-temporal context modeling and demonstrate our parallel spatial and temporal encoding design to be the optimal choice. Video-FocalNets perform favorably well against the state-of-the-art transformer-based models for video recognition on three large-scale datasets (Kinetics-400, Kinetics-600, and SS-v2) at a lower computational cost. Our code/models are publicly released.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Thawakar, Omkar; Anwer, Rao Muhammad; Laaksonen, Jorma; Reiner, Orly; Shah, Mubarak; Khan, Fahad Shahbaz
3D Mitochondria Instance Segmentation with Spatio-Temporal Transformers Conference
Lecture Notes in Computer Science, vol. 14227, Medical Image Computing and Computer Assisted Intervention – MICCAI 2023, 2023, ISBN: 978-3-031-43993-3.
@conference{nokey,
title = {3D Mitochondria Instance Segmentation with Spatio-Temporal Transformers},
author = {Omkar Thawakar and Rao Muhammad Anwer and Jorma Laaksonen and Orly Reiner and Mubarak Shah and Fahad Shahbaz Khan},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/2303.12073.pdf
https://github.com/OmkarThawakar/STT-UNET
https://arxiv.org/pdf/2303.12073.pdf
https://link.springer.com/chapter/10.1007/978-3-031-43993-3_59
},
doi = {https://doi.org/10.1007/978-3-031-43993-3_59},
isbn = {978-3-031-43993-3},
year = {2023},
date = {2023-10-01},
booktitle = {Lecture Notes in Computer Science},
journal = {arXiv:2303.12073},
volume = {14227},
publisher = {Medical Image Computing and Computer Assisted Intervention – MICCAI 2023},
abstract = {Accurate 3D mitochondria instance segmentation in electron microscopy (EM) is a challenging problem and serves as a prerequisite to empirically analyze their distributions and morphology. Most existing approaches employ 3D convolutions to obtain representative features. However, these convolution-based approaches struggle to effectively capture long-range dependencies in the volume mitochondria data, due to their limited local receptive field. To address this, we propose a hybrid encoder-decoder framework based on a split spatio-temporal attention module that efficiently computes spatial and temporal self-attentions in parallel, which are later fused through a deformable convolution. Further, we introduce a semantic foreground-background adversarial loss during training that aids in delineating the region of mitochondria instances from the background clutter. Our extensive experiments on three benchmarks, Lucchi, MitoEM-R and MitoEM-H, reveal the benefits of the proposed contributions achieving state-of-the-art results on all three datasets. Our code and models are available at https://github.com/OmkarThawakar/STT-UNET.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Kini, Jyoti; Fleischer, Sarah; Dave, Ishan; Shah, Mubarak
Ensemble Modeling for Multimodal Visual Action Recognition Workshop
22nd International Conference on Image Analysis and Processing Workshops - Multimodal Action Recognition on the MECCANO Dataset, 2023.
BibTeX | Links:
@workshop{Kini2023b,
title = {Ensemble Modeling for Multimodal Visual Action Recognition},
author = {Jyoti Kini and Sarah Fleischer and Ishan Dave and Mubarak Shah},
url = {https://arxiv.org/pdf/2308.05430.pdf
https://www.crcv.ucf.edu/research/projects/ensemble-modeling-for-multimodal-visual-action-recognition/},
year = {2023},
date = {2023-09-11},
urldate = {2023-09-11},
booktitle = {22nd International Conference on Image Analysis and Processing Workshops - Multimodal Action Recognition on the MECCANO Dataset},
keywords = {},
pubstate = {published},
tppubtype = {workshop}
}
de Blegiers, Tristan; Dave, Ishan Rajendrakumar; Yousaf, Adeel; Shah, Mubarak
EventTransAct: A Video Transformer-based Framework for Event-camera Based Action Recognition Conference
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023.
BibTeX | Links:
@conference{deBlegiers2023,
title = {EventTransAct: A Video Transformer-based Framework for Event-camera Based Action Recognition},
author = {Tristan de Blegiers and Ishan Rajendrakumar Dave and Adeel Yousaf and Mubarak Shah},
url = {https://arxiv.org/pdf/2308.13711
https://github.com/tristandb8/EventTransAct
https://tristandb8.github.io/EventTransAct_webpage/
https://www.youtube.com/watch?v=YCff-rTrgco},
year = {2023},
date = {2023-08-25},
publisher = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Zhu, Sijie; Yang, Linjie; Chen, Chen; Shah, Mubarak; Shen, Xiaohui; Wang, Heng
R2Former: Unified retrieval and ranking Transformer for Place Recognition Conference
IEEE Computer Vision and Pattern Recognition, 2023.
@conference{Zhu2023,
title = {R2Former: Unified retrieval and ranking Transformer for Place Recognition},
author = {Sijie Zhu and Linjie Yang and Chen Chen and Mubarak Shah and Xiaohui Shen and Heng Wang},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/CVPR_2023_PlaceRecognitionFinal.pdf
https://arxiv.org/pdf/2304.03410.pdf
https://github.com/Jeff-Zilence/R2Former},
year = {2023},
date = {2023-06-18},
urldate = {2023-06-18},
publisher = {IEEE Computer Vision and Pattern Recognition},
abstract = {Visual Place Recognition (VPR) estimates the location of query images by matching them with images in a reference database. Conventional methods generally adopt aggregated CNN features for global retrieval and RANSAC-based geometric verification for reranking. However, RANSAC only considers geometric information but ignores other possible information that could be useful for reranking, e.g. local feature correlation, and attention values. In this paper, we propose a unified place recognition framework that handles both retrieval and reranking with a novel transformer model, named R2Former. The proposed reranking module takes feature correlation, attention value, and xy coordinates into account, and learns to determine whether the image pair is from the same location. The whole pipeline is end-to-end trainable and the reranking module alone can
also be adopted on other CNN or transformer backbones as a generic component. Remarkably, R2Former significantly
outperforms state-of-the-art methods on major VPR datasets with much less inference time and memory consumption.
It also achieves the state-of-the-art on the holdout MSLS challenge set and could serve as a simple yet strong solution for real-world large-scale applications. Experiments also show vision transformer tokens are comparable and sometimes better than CNN local features on local matching. The code will be publicly available. },
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
also be adopted on other CNN or transformer backbones as a generic component. Remarkably, R2Former significantly
outperforms state-of-the-art methods on major VPR datasets with much less inference time and memory consumption.
It also achieves the state-of-the-art on the holdout MSLS challenge set and could serve as a simple yet strong solution for real-world large-scale applications. Experiments also show vision transformer tokens are comparable and sometimes better than CNN local features on local matching. The code will be publicly available.
Gupta, Rohit; Roy, Anirban; Kim, Sujeong; Christensen, Claire; Grindal, Todd; Gerard, Sarah Nixon; Cincebeaux, Madeline; Divakaran, Ajay; Shah, Mubarak
Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos Conference
IEEE Computer Vision and Pattern Recognition, 2023.
@conference{Gupta2023b,
title = {Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos},
author = {Rohit Gupta and Anirban Roy and Sujeong Kim and Claire Christensen and Todd Grindal and Sarah Nixon Gerard and Madeline Cincebeaux and Ajay Divakaran and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/Rohit_SRI_CVPR2023_Multi_Modal_Multi_Label_Contrastive_Learning_Camera_Ready-4.pdf
https://www.rohitg.xyz/MMContrast/
https://nusci.csl.sri.com/project/APPROVE},
year = {2023},
date = {2023-06-18},
urldate = {2023-06-18},
publisher = {IEEE Computer Vision and Pattern Recognition},
abstract = {The recent growth in the consumption of online media by children during early childhood necessitates data-driven tools enabling educators to filter out appropriate educational content for young learners. This paper presents an approach for detecting educational content in online videos. We focus on two widely used educational content classes: literacy and math. For each class, we choose prominent codes (sub-classes) based on the Common Core Standards. For example, literacy codes include ‘letter names’, ‘letter sounds’, and math codes include ‘counting’, ‘sorting’. We pose this as a fine-grained multilabel classification problem as videos can contain multiple types of educational content and the content classes can get visually similar (e.g., ‘letter names’ vs ‘letter sounds’). We propose a novel class prototypes based supervised contrastive learning approach that can handle fine-grained samples associated with multiple labels. We learn a class prototype for each class and a loss function is employed to minimize the distances between a class prototype and the samples from the class. Similarly,
distances between a class prototype and the samples from other classes are maximized. As the alignment between visual
and audio cues are crucial for effective comprehension, we consider a multimodal transformer network to capture the interaction between visual and audio cues in videos while learning the embedding for videos. For evaluation, we present a dataset, APPROVE, employing educational videos from YouTube labeled with fine-grained education classes by education researchers. APPROVE consists of 193 hours of expert-annotated videos with 19 classes. The proposed approach outperforms strong baselines on APPROVE and other benchmarks such as Youtube-8M, and COIN. The dataset is available at https://nusci.csl.sri.com/project/APPROVE.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
distances between a class prototype and the samples from other classes are maximized. As the alignment between visual
and audio cues are crucial for effective comprehension, we consider a multimodal transformer network to capture the interaction between visual and audio cues in videos while learning the embedding for videos. For evaluation, we present a dataset, APPROVE, employing educational videos from YouTube labeled with fine-grained education classes by education researchers. APPROVE consists of 193 hours of expert-annotated videos with 19 classes. The proposed approach outperforms strong baselines on APPROVE and other benchmarks such as Youtube-8M, and COIN. The dataset is available at https://nusci.csl.sri.com/project/APPROVE.
Dave, Ishan Rajendrakumar; Rizve, Mamshad Nayeem; Chen, Chen; Shah, Mubarak
TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition Conference
IEEE Computer Vision and Pattern Recognition, 2023.
@conference{Dave2023,
title = {TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition},
author = {Ishan Rajendrakumar Dave and Mamshad Nayeem Rizve and Chen Chen and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/TimeBalance_CVPR23_arxiv.pdf
https://daveishan.github.io/timebalance_webpage/
https://github.com/DAVEISHAN/TimeBalance},
year = {2023},
date = {2023-06-18},
urldate = {2023-06-18},
publisher = {IEEE Computer Vision and Pattern Recognition},
abstract = {Semi-Supervised Learning can be more beneficial for the video domain compared to images because of its higher annotation
cost and dimensionality. Besides, any video understanding task requires reasoning over both spatial and temporal dimensions. In order to learn both the static and motion related features for the semi-supervised action recognition task, existing methods rely on hard input inductive biases like using two-modalities (RGB and Optical-flow) or two-stream of different playback rates.
Instead of utilizing unlabeled videos through diverse input streams, we rely on self-supervised video representations,
particularly, we utilize temporally-invariant and temporally-distinctive representations. We observe that these representations complement each other depending on the nature of the action. Based on this observation, we propose a student-teacher semi-supervised learning framework, TimeBalance, where we distill the knowledge from a temporally-invariant and a temporally-distinctive teacher. Depending on the nature of the unlabeled video, we dynamically combine the knowledge of these two teachers based on a novel temporal similarity-based reweighting scheme. Our method achieves state-of-the-art performance
on three action recognition benchmarks: UCF101, HMDB51, and Kinetics400. Code: https://github.com/DAVEISHAN/TimeBalance.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
cost and dimensionality. Besides, any video understanding task requires reasoning over both spatial and temporal dimensions. In order to learn both the static and motion related features for the semi-supervised action recognition task, existing methods rely on hard input inductive biases like using two-modalities (RGB and Optical-flow) or two-stream of different playback rates.
Instead of utilizing unlabeled videos through diverse input streams, we rely on self-supervised video representations,
particularly, we utilize temporally-invariant and temporally-distinctive representations. We observe that these representations complement each other depending on the nature of the action. Based on this observation, we propose a student-teacher semi-supervised learning framework, TimeBalance, where we distill the knowledge from a temporally-invariant and a temporally-distinctive teacher. Depending on the nature of the unlabeled video, we dynamically combine the knowledge of these two teachers based on a novel temporal similarity-based reweighting scheme. Our method achieves state-of-the-art performance
on three action recognition benchmarks: UCF101, HMDB51, and Kinetics400. Code: https://github.com/DAVEISHAN/TimeBalance.
Rizve, Mamshad Nayeem; Mittal, Gaurav; Yu, Ye; Hall, Matthew; Sajeev, Sandra; Shah, Mubarak; Chen, Mei
PivoTAL: Prior-Driven Supervision for Weakly-Supervised Temporal Action Localization Conference
IEEE Computer Vision and Pattern Recognition, 2023.
BibTeX | Links:
@conference{Rizve2023,
title = {PivoTAL: Prior-Driven Supervision for Weakly-Supervised Temporal Action Localization},
author = {Mamshad Nayeem Rizve and Gaurav Mittal and Ye Yu and Matthew Hall and Sandra Sajeev and Mubarak Shah and Mei Chen},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/PivoTAL_CVPR_2023.pdf
https://www.crcv.ucf.edu/wp-content/uploads/2018/11/PivoTAL_CVPR_2023_Supplemental_Material.pdf
https://www.crcv.ucf.edu/wp-content/uploads/2018/11/PivoTAL_CVPR2023_Poster.pdf
https://www.youtube.com/watch?v=6kAoQjXfzio},
year = {2023},
date = {2023-06-18},
urldate = {2023-06-18},
publisher = {IEEE Computer Vision and Pattern Recognition},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Urooj, Aisha; Kuehne, Hilde; Wu, Bo; Chheu, Kim; Bousselham, Walid; Gan, Chuang; Lobo, Niels; Shah, Mubarak
Learning Situation Hyper-Graphs for Video Question Answering Conference
IEEE Computer Vision and Pattern Recognition, 2023.
@conference{Urooj2023,
title = {Learning Situation Hyper-Graphs for Video Question Answering},
author = {Aisha Urooj and Hilde Kuehne and Bo Wu and Kim Chheu and Walid Bousselham and Chuang Gan and Niels Lobo and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/2023072364-4.pdf
https://www.crcv.ucf.edu/wp-content/uploads/2018/11/SHG_VQA_CVPR2023_cam_ready_supp.pdf
},
year = {2023},
date = {2023-06-18},
urldate = {2023-06-18},
publisher = {IEEE Computer Vision and Pattern Recognition},
abstract = {Answering questions about complex situations in videos requires not only capturing the presence of actors, objects, and their relations but also the evolution of these relationships over time. A situation hyper-graph is a representation that describes situations as scene sub-graphs for video frames and hyper-edges for connected sub-graphs and has been proposed to capture all such information in a compact structured form. In this work, we propose an architecture for Video Question Answering (VQA) that enables answering questions related to video content by predicting situation hyper-graphs, coined Situation Hyper-Graph based Video Question Answering (SHG-VQA). To this end, we train a situation hyper-graph decoder to implicitly identify graph representations with actions and object/human-object relationships from the input video clip. and to use cross-attention
between the predicted situation hyper-graphs and the question embedding to predict the correct answer. The proposed
method is trained in an end-to-end manner and optimized by a VQA loss with the cross-entropy function and a Hungarian
matching loss for the situation graph prediction. The effectiveness of the proposed architecture is extensively evaluated
on two challenging benchmarks: AGQA and STAR. Our results show that learning the underlying situation hypergraphs
helps the system to significantly improve its performance for novel challenges of video question-answering tasks. },
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
between the predicted situation hyper-graphs and the question embedding to predict the correct answer. The proposed
method is trained in an end-to-end manner and optimized by a VQA loss with the cross-entropy function and a Hungarian
matching loss for the situation graph prediction. The effectiveness of the proposed architecture is extensively evaluated
on two challenging benchmarks: AGQA and STAR. Our results show that learning the underlying situation hypergraphs
helps the system to significantly improve its performance for novel challenges of video question-answering tasks.
Bhunia, Ankan Kumar; Khan, Salman; Cholakkal, Hisham; Anwer, Rao Muhammad; Laaksonen, Jorma Tapio; Shah, Mubarak; Khan, Fahad
Person Image Synthesis via Denoising Diffusion Model Conference
IEEE Computer Vision and Pattern Recognition, 2023.
@conference{Bhunia2023,
title = {Person Image Synthesis via Denoising Diffusion Model},
author = {Ankan Kumar Bhunia and Salman Khan and Hisham Cholakkal and Rao Muhammad Anwer and Jorma Tapio Laaksonen and Mubarak Shah and Fahad Khan},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/person_image_synthesis_via_den-Camera-ready-PDF.pdf
https://lnkd.in/d-8v3r8B
https://lnkd.in/dGPTjvge
https://lnkd.in/dxcGQsUX
https://github.com/ankanbhunia/PIDM},
year = {2023},
date = {2023-06-18},
urldate = {2023-06-18},
publisher = {IEEE Computer Vision and Pattern Recognition},
abstract = {The pose-guided person image generation task requires synthesizing photorealistic images of humans in arbitrary poses. The existing approaches use generative adversarial networks that do not necessarily maintain realistic textures or need dense correspondences that struggle to handle complex deformations and severe occlusions. In this work, we show how denoising diffusion models can be applied for high-fidelity person image synthesis with
strong sample diversity and enhanced mode coverage of the learnt data distribution. Our proposed Person Image Diffusion Model (PIDM) disintegrates the complex transfer problem into a series of simpler forward-backward denoising steps. This helps in learning plausible source-to-target transformation trajectories that result in faithful textures and undistorted appearance details. We introduce a ‘texture diffusion module’ based on cross-attention to accurately model the correspondences between appearance and pose information available in source and target images. Further, we propose disentangled classifier-free guidance’ to ensure close resemblance between the conditional inputs and the synthesized output in terms of both pose and appearance information. Our extensive results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios. We also show how our generated images can help in downstream tasks. Code is available at https://github.com/ankanbhunia/PIDM.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
strong sample diversity and enhanced mode coverage of the learnt data distribution. Our proposed Person Image Diffusion Model (PIDM) disintegrates the complex transfer problem into a series of simpler forward-backward denoising steps. This helps in learning plausible source-to-target transformation trajectories that result in faithful textures and undistorted appearance details. We introduce a ‘texture diffusion module’ based on cross-attention to accurately model the correspondences between appearance and pose information available in source and target images. Further, we propose disentangled classifier-free guidance’ to ensure close resemblance between the conditional inputs and the synthesized output in terms of both pose and appearance information. Our extensive results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios. We also show how our generated images can help in downstream tasks. Code is available at https://github.com/ankanbhunia/PIDM.
Wasim, Syed Talal; Naseer, Muzammal; Khan, Salman; Khan, Fahad; Shah, Mubarak
Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting Conference
IEEE Computer Vision and Pattern Recognition, 2023.
@conference{Wasim2023,
title = {Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting},
author = {Syed Talal Wasim and Muzammal Naseer and Salman Khan and Fahad Khan and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/vita_clip_video_and_text_adapt-Camera-ready-PDF.pdf
},
year = {2023},
date = {2023-06-18},
urldate = {2023-06-18},
publisher = {IEEE Computer Vision and Pattern Recognition},
abstract = {Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Finetuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this,
recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative
conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes and models will be publicly released. },
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative
conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes and models will be publicly released.
Clark, Brandon Eric; Kerrigan, Alec; Kulkarni, Parth Parag; Cepeda, Vicente Vivanco; Shah, Mubarak
Where We Are and What We're Looking At: Query Based Worldwide Image Geo-localization Using Hierarchies and Scenes Conference
IEEE Computer Vision and Pattern Recognition, 2023.
@conference{Clark2023,
title = {Where We Are and What We're Looking At: Query Based Worldwide Image Geo-localization Using Hierarchies and Scenes},
author = {Brandon Eric Clark and Alec Kerrigan and Parth Parag Kulkarni and Vicente Vivanco Cepeda and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/Camera-Ready-Full-Paper.pdf
https://github.com/AHKerrigan/GeoGuessNet
https://www.crcv.ucf.edu/wp-content/uploads/2018/11/CVPR23-Poster_THU-PM-246-1.pdf
https://www.youtube.com/watch?v=fp3hZGbwPqk},
year = {2023},
date = {2023-06-18},
urldate = {2023-06-18},
publisher = {IEEE Computer Vision and Pattern Recognition},
abstract = {Determining the exact latitude and longitude that a photo was taken is a useful and widely applicable task, yet it remains exceptionally difficult despite the accelerated progress of other computer vision tasks. Most previous approaches have opted to learn single representations of query images, which are then classified at different levels of geographic granularity. These approaches fail to exploit the different visual cues that give context to different hierarchies, such as the country, state, and city level. To this end, we introduce an end-to-end transformer-based architecture that exploits the relationship between different geographic levels (which we refer to as hierarchies) and the corresponding visual scene information in an image through hierarchical cross-attention. We achieve this by learning a query for each geographic hierarchy and scene type. Furthermore, we learn a separate representation for different environmental scenes, as different scenes in the same location are often defined by completely different visual features. We achieve state of the art accuracy on 4 standard geo-localization datasets : Im2GPS, Im2GPS3k, YFCC4k, and YFCC26k, as well as qualitatively demonstrate how our method learns different representations for different visual hierarchies and scenes, which has not been demonstrated in the previous methods. Above previous testing datasets mostly consist of iconic landmarks or images taken from social media, which makes the dataset a simple memory task, or makes it biased towards certain places. To address this issue we introduce a much harder testing dataset, Google-World-Streets-15k, comprised of images taken from Google Streetview covering the whole planet and present state of the art results. Our code can be found at https://github.com/AHKerrigan/GeoGuessNet.
},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Zheng, Ce; Wu, Wenhan; Chen, Chen; Yang, Taojiannan; Zhu, Sijie; Shen, Ju; Kehtarnavaz, Nasser; Shah, Mubarak
Deep Learning-Based Human Pose Estimation: A Survey Journal Article
In: ACM Computing Surveys, 2023.
BibTeX | Links:
@article{Zheng2023c,
title = {Deep Learning-Based Human Pose Estimation: A Survey},
author = {Ce Zheng and Wenhan Wu and Chen Chen and Taojiannan Yang and Sijie Zhu and Ju Shen and Nasser Kehtarnavaz and Mubarak Shah},
editor = {Albert Y H Zomaya},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/3603618.pdf
https://github.com/zczcwh/DL-HPE},
doi = {10.1145/3603618},
year = {2023},
date = {2023-06-09},
journal = {ACM Computing Surveys},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Kini, Jyoti; Mian, Ajmal; Shah, Mubarak
3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D Point Clouds Conference
IEEE International Conference on Robotics and Automation, 2023.
BibTeX | Links:
@conference{Kini2023,
title = {3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D Point Clouds},
author = {Jyoti Kini and Ajmal Mian and Mubarak Shah},
url = {https://arxiv.org/pdf/2211.00746.pdf},
year = {2023},
date = {2023-05-29},
urldate = {2023-05-29},
booktitle = {IEEE International Conference on Robotics and Automation},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Sangam, Tushar; Dave, Ishan Rajendrakumar; Sultani, Waqas; Shah, Mubarak
TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos Conference
IEEE International Conference on Robotics and Automation, 2023.
BibTeX | Links:
@conference{Sangam2023,
title = {TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos},
author = {Tushar Sangam and Ishan Rajendrakumar Dave and Waqas Sultani and Mubarak Shah},
url = {https://arxiv.org/pdf/2210.08423.pdf
https://tusharsangam.github.io/TransVisDrone-project-page/},
year = {2023},
date = {2023-05-29},
urldate = {2023-05-29},
booktitle = {IEEE International Conference on Robotics and Automation},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Yang, Peiyu; Akhtar, Naveed; Wen, Zeyi; Shah, Mubarak; Mian, Ajmal
Re-calibrating Feature Attributions for Model Interpretation Conference
Re-calibrating Feature Attributions for Model Interpretation, Eleventh International Conference on Learning Representations (ICLR), notable top 25%, 2023.
@conference{nokey,
title = {Re-calibrating Feature Attributions for Model Interpretation},
author = {Peiyu Yang and Naveed Akhtar and Zeyi Wen and Mubarak Shah and Ajmal Mian},
year = {2023},
date = {2023-05-01},
urldate = {2023-05-01},
booktitle = {Re-calibrating Feature Attributions for Model Interpretation},
publisher = {Eleventh International Conference on Learning Representations (ICLR), notable top 25%},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Beetham, James; Kardan, Navid; Mian, Ajmal; Shah, Mubarak
Dual Student Networks for Data-Free Model Stealing Conference
Eleventh International Conference on Learning Representations, 2023.
BibTeX | Links:
@conference{Beetham2023b,
title = {Dual Student Networks for Data-Free Model Stealing},
author = {James Beetham and Navid Kardan and Ajmal Mian and Mubarak Shah},
url = {https://arxiv.org/abs/2309.10058},
year = {2023},
date = {2023-05-01},
booktitle = {Eleventh International Conference on Learning Representations},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Barbalau, Antonio; Ionescu, Radu Tudor; Georgescu, Mariana-Iuliana; Dueholm, Jacob; Ramachandra, Bharathkumar; Nasrollahi, Kamal; Khan, Fahad Shahbaz; Moeslund, Thomas B.; Shah, Mubarak
SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video Anomaly Detection Journal Article
In: Computer Vision and Image Understanding, 2023.
BibTeX | Links:
@article{Barbalau2023,
title = {SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video Anomaly Detection},
author = {Antonio Barbalau and Radu Tudor Ionescu and Mariana-Iuliana Georgescu and Jacob Dueholm and Bharathkumar Ramachandra and Kamal Nasrollahi and Fahad Shahbaz Khan and Thomas B. Moeslund and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/SSMTL.pdf},
year = {2023},
date = {2023-02-11},
urldate = {2023-02-11},
journal = {Computer Vision and Image Understanding},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Vahidian, Saeed; Morafah, Mahdi; Wang, Weijia; Kungurtsev, Vyacheslav; Chen, Chen; Shah, Mubarak; Lin, Bill
Efficient Distribution Similarity Identification in Clustered Federated Learning via Principal Angles Between Client Data Subspaces Conference
37th AAAI Conference on Artificial Intelligence, 2023.
BibTeX | Links:
@conference{Vahidian2023,
title = {Efficient Distribution Similarity Identification in Clustered Federated Learning via Principal Angles Between Client Data Subspaces},
author = {Saeed Vahidian and Mahdi Morafah and Weijia Wang and Vyacheslav Kungurtsev and Chen Chen and Mubarak Shah and Bill Lin},
url = {https://arxiv.org/abs/2209.10526},
year = {2023},
date = {2023-02-07},
urldate = {2023-02-07},
publisher = {37th AAAI Conference on Artificial Intelligence},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Gupta, Rohit; Akhtar, Naveed; Mian, Ajmal; Shah, Mubarak
Contrastive Self-Supervised Learning Leads to Higher Adversarial Susceptibility Conference
37th AAAI Conference on Artificial Intelligence, 2023.
BibTeX | Links:
@conference{Gupta2023,
title = {Contrastive Self-Supervised Learning Leads to Higher Adversarial Susceptibility},
author = {Rohit Gupta and Naveed Akhtar and Ajmal Mian and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/2207.10862.pdf},
year = {2023},
date = {2023-02-07},
publisher = {37th AAAI Conference on Artificial Intelligence},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
2022
Xu, Ziwei; Rawat, Yogesh; Wong, Yongkang; Kankanhalli, Mohan; Shah, Mubarak
Don’t Pour Cereal into Coffee: Differentiable Temporal Logic for Temporal Action Segmentation Conference
36th Conference on Neural Information Processing Systems (NeurIPS 2022), 2022.
@conference{Xu2022,
title = {Don’t Pour Cereal into Coffee: Differentiable Temporal Logic for Temporal Action Segmentation},
author = {Ziwei Xu and Yogesh Rawat and Yongkang Wong and Mohan Kankanhalli and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/ziwei_neurips2022.pdf
https://diff-tl.github.io/
https://github.com/ZiweiXU/DTL-action-segmentation},
year = {2022},
date = {2022-11-09},
urldate = {2022-11-09},
publisher = {36th Conference on Neural Information Processing Systems (NeurIPS 2022)},
abstract = {We propose Differentiable Temporal Logic (DTL), a model-agnostic framework that introduces temporal constraints to deep networks. DTL treats the outputs of a network as a truth assignment of a temporal logic formula, and computes a temporal logic loss reflecting the consistency between the output and the constraints. We propose a comprehensive set of constraints, which are implicit in data annotations, and incorporate them with deep networks via DTL. We evaluate the effectiveness of DTL on the temporal action segmentation task and observe improved performance and reduced logical errors in the output of different task models. Furthermore, we provide an extensive analysis to visualize the desirable effects of DTL.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Vyas, Shruti; Chen, Chen; Shah, Mubarak
GAMa: Cross-view Video Geo-localization Conference
European Conference on Computer Vision, 2022.
@conference{Vyas2022,
title = {GAMa: Cross-view Video Geo-localization},
author = {Shruti Vyas and Chen Chen and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/1512.pdf
https://www.crcv.ucf.edu/wp-content/uploads/2018/11/1512-supp.pdf
https://youtu.be/KSHuer_VXJo},
year = {2022},
date = {2022-10-23},
urldate = {2022-10-23},
booktitle = {European Conference on Computer Vision},
abstract = {The existing work in cross-view geo-localization is based on images where a ground panorama is matched to an aerial image. In this work, we focus on ground videos instead of images which provides ad-ditional contextual cues which are important for this task. There are no existing datasets for this problem, therefore we propose GAMa dataset, a large-scale dataset with ground videos and corresponding aerial images. We also propose a novel approach to solve this problem. At clip-level, a short video clip is matched with corresponding aerial image and is later used to get video-level geo-localization of a long video. Moreover, we propose a hierarchical approach to further improve the clip-level geo-localization. On this challenging dataset, with unaligned images and lim-ited field of view, our proposed method achieves a Top-1 recall rate of 19.4% and 45.1% @1.0mile. Code & dataset are available at this link.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Khan, Aisha Urooj; Kuehne, Hilde; Gan, Chuang; Lobo, Niels Da Vitoria; Shah, Mubarak
Weakly Supervised Grounding for VQA in Vision-Language Transformers Conference
European Conference on Computer Vision, 2022.
@conference{Khan2022,
title = {Weakly Supervised Grounding for VQA in Vision-Language Transformers},
author = {Aisha Urooj Khan and Hilde Kuehne and Chuang Gan and Niels Da Vitoria Lobo and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/1011.pdf
https://www.crcv.ucf.edu/wp-content/uploads/2018/11/1011-supp.pdf
https://github.com/aurooj/WSG-VQA-VLTransformers
https://youtu.be/dekmVb6lq3I},
year = {2022},
date = {2022-10-23},
urldate = {2022-10-23},
booktitle = {European Conference on Computer Vision},
abstract = {Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding. However, most systems that show good performance of those tasks still rely on pre-trained object
detectors during training, which limits their applicability to the object classes available for those detectors. To mitigate this limitation, this paper
focuses on the problem of weakly supervised grounding in the context of visual question answering in transformers. Our approach leverages
capsules by transforming each visual token into a capsule representation in the visual encoder; it then uses activations from language self-attention layers as a text-guided selection module to mask those capsules before they are forwarded to the next layer. We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding. Our experiments show that: while removing the information of masked
objects from standard transformer architectures leads to a significant drop in performance, the integration of capsules significantly improves the grounding ability of such systems and provides new state-of-the-art results compared to other approaches in the field.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
detectors during training, which limits their applicability to the object classes available for those detectors. To mitigate this limitation, this paper
focuses on the problem of weakly supervised grounding in the context of visual question answering in transformers. Our approach leverages
capsules by transforming each visual token into a capsule representation in the visual encoder; it then uses activations from language self-attention layers as a text-guided selection module to mask those capsules before they are forwarded to the next layer. We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding. Our experiments show that: while removing the information of masked
objects from standard transformer architectures leads to a significant drop in performance, the integration of capsules significantly improves the grounding ability of such systems and provides new state-of-the-art results compared to other approaches in the field.
Rizve, Mamshad Nayeem; Kardan, Navid; Khan, Salman; Khan, Fahad Shahbaz; Shah, Mubarak
OpenLDN: Learning to Discover Novel Classes for Open-World Semi-Supervised Learning Conference
European Conference on Computer Vision, 2022.
@conference{Rizve2022,
title = {OpenLDN: Learning to Discover Novel Classes for Open-World Semi-Supervised Learning},
author = {Mamshad Nayeem Rizve and Navid Kardan and Salman Khan and Fahad Shahbaz Khan and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/6665.pdf
https://www.crcv.ucf.edu/wp-content/uploads/2018/11/6665-supp.pdf
https://github.com/nayeemrizve/OpenLDN
https://youtu.be/p2lYqvklcjA},
year = {2022},
date = {2022-10-23},
urldate = {2022-10-23},
booktitle = {European Conference on Computer Vision},
abstract = {Semi-supervised learning (SSL) is one of the dominant approaches to address the annotation bottleneck of supervised learning. Recent SSL methods can effectively leverage a large repository of unlabeled data to improve performance while relying on a small set of labeled data. One common assumption in most SSL methods is that the labeled and unlabeled data are from the same data distribution. However, this is hardly the case in many real-world scenarios, which limits their applicability. In this work, instead, we attempt to solve the challenging open-world SSL problem that does not make such an assumption. In the open-world SSL problem, the objective is to recognize samples of known classes, and simultaneously detect and cluster samples belonging to novel classes present in unlabeled data. This work introduces OpenLDN that utilizes a pairwise similarity loss to discover novel classes. Using a bi-level optimization rule this pairwise similarity loss exploits the information available in the labeled set to implicitly cluster novel class samples, while simultaneously recognizing samples from known classes. After discovering novel classes, OpenLDN transforms the open-world SSL problem into a standard SSL problem to achieve additional performance gains using existing SSL methods. Our extensive experiments demonstrate that OpenLDN outperforms the current state-of-the-art methods on multiple popular classification benchmarks while providing a better accuracy/training time trade-off. Code: https://github.com/nayeemrizve/OpenLDN},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
- Fellow, ACM (Association for Computing Machinery), 2021.
- Fellow, NAI (National Academy of Inventors), 2020.
- Fellow, AAAS (American Association of Advancement of Science), 2010.
- Fellow, SPIE (Society of Photographic Instrumentation Engineers), 2008.
- Fellow, IAPR (International Association of Pattern Recognition), 2006.
- Fellow, IEEE (Institute of Electrical and Electronic Engineers), 2003.
- ACM SIGMM award for Outstanding Technical Contributions to Multimedia Computing, Communications and Applications, 2019.
- Inducted to UCF Chapter of National Academy of Innovators, December 2017.
- UCF Luminary Award, October 2017
- University Excellence in Research Award, 2017.
- Faculty Excellence in Mentoring Doctoral Students Award, 2017
- Top 10% Paper Award at International Conference on Image Processing (ICIP-2014), 2014.
- 2nd place in NHK Where is beauty? Grand Challenge at the ACM Multimedia 2013 conference.
- NGA(National Geo-spatial Intelligence Agency) Best Research Poster award at NARP Symposium Award, 2013.
- University Distinguished Researcher award, 2012.
- College of Engineering & Computer Science Advisory Board Award For Faculty Excellence, 2011
- Scholarship of Teaching and Learning (SoTL) award, 2011.
- Finalist for the Best Paper award, ACM Conference on Multimedia, 2010.
- ACM Distinguished Speaker (DSP), 2008-2014.
- University Distinguished Researcher award, 2007.
- Sindhi Association of North American award, 2007.
- Pegasus Professor, 2006.
- UCF Millionaires' Club, 2005, 2006, 2009-2013, 2015.
- Honorable mention, ICCV 2005 Where Am I? Challenge Problem.
- Finalist for the Best Paper award, ACM Conference on Multimedia, 2005.
- Research Incentive Award (RIA), 2003, 2009, 2014.
- Teaching Incentive Program (TIP) Award, 1996, 2003.
- IEEE Distinguished Visitors Program Speaker, 1997-2000.
- Engineering Achievement Award of Information Systems Division of Harris Corporation, 1999.
- IEEE Outstanding Engineering Educator Award, 1997.
- TOKTEN awards by UNDP, 1992, 1995, 2000.
- Philips International Institute Scholarship 1980.
- ACM Member
- ACM SIGMM Member
- IEEE Life Fellow
- IEEE Computer Society Member
- Where We Are and What We're Looking At, CVPR 2023, 18-22 June 2023
- Human Activity Recognition: Learning with Less Labels and Privacy Preservation, Keynote Talk at SPIE automatic Target Recognition XXXII, 4-5 April 2022
- Overview of our Research 2022
- MMP Tracking Workshop Keynote - October 28, 2021
- 34 Years of Research Experience for Undergraduates in Computer Vision
- CVPR 2021 Tutorial: "Cross-View Geo-Localization: Ground-to-Aerial Image Matching"
- Current Funded Research at CRCV - June 2021
- Learning With Less Labels
- Adversarial Computer Vision
- Keynote_(Person Re-Identification and Tracking in Multiple Non-Overlapping Cameras)
- SIGMM Technical Achievement Award 2019, Keynote Talk
- Capsule Networks for Computer Vision – CVPR 2019 Tutorial
- CAP6412 Advanced Computer Vision - Spring 2019
- Deep Learning
- CAP6412 Advanced Computer Vision - Spring 2018
- UCF Computer Vision Video Lectures 2014
- Multi-Object Tracking: Crowd Tracking and Group Action Recognition
- UCF Computer Vision Video Lectures 2012
- Have taught ten different courses at the graduate and undergraduate level, introduced a new honors course (co-taught with a Mathematics Professor), and directed numerous independent studies of undergraduate and graduate students;
- Have conducted seven short courses and tutorials in ve different countries (Italy, US, Pakistan, Mexico, Taiwan) (http://www.cs.ucf.edu/vision/accv2000h-6.pdf.);
- I have authored an unpublished book Fundamental of Computer Vision, which I use for my class and is also available on the web: http://www.cs.ucf.edu/courses/cap6411/book.pdf.
- My pedagogical contributions are covered in four text books by popular authors: Computer Vision: Algorithms and Applications, Richard Szeliski; Robot Vision, Haralick and Shapiro; Introductory Techniques for 3D Computer Vision, Veri and Trucco; and Computer Vision, Shapiro and Stockman.
- I have provided videos of my Computer Vision Lectures on youtube, which has received close to one million views: https://www.youtube.com/playlist?list=PLd3hlSJsX_ Imk_BPmB_H3AQjFKZS9XgZm
- CAP 6412 Advanced Computer Vision https://www.crcv.ucf.edu/courses/cap6412-spring-2019/
- CAP 5415 Computer Vision http://www.cs.ucf.edu/courses/cap6411/cap5415
- CAP 6411 Computer Vision Systems http://www.cs.ucf.edu/courses/cap6411/cap6411/fall02/cap6411_fall02.html
- COT 6505 Numerical Optimization http://www.cs.ucf.edu/courses/cap6411/cot6505/spring03/cot6505_sp03.html
- CAP 3930H Computer Vision Guided Tour of Mathematics
- CAP 6938 Special Topics: Mathematical Tools for Computer Vision
- CAP 4932 Intro Robot Vision
- COT 4110 Numerical Calculus
- COP 3400 Assembly Language
- COP 3402 Systems Concepts and Programming