The 18th European Conference on Computer Vision ECCV 2024 is a biennial premier research conference in Computer Vision and Machine Learning, managed by the European Computer Vision Association (ECVA). It is held on even years and gathers the scientific and industrial communities on these areas. The first ECCV was held in 1990 in Antibes, France, and subsequently organized all over Europe. Paper proceedings are published by Springer Science+Business Media.
UCF researchers had a record number of 20 papers accepted into the ECCV 2024 (https://eccv.ecva.net/) conference that will take place in Milan, Italy from September 29 – October 4, 2024.
The h5-index is the h-index for articles published in the last 5 complete years. According to Google Scholar Metrics, ECCV is ranked 3rd in the Computer Vision and Pattern Recognition subcategory h5-index rankings.
You can access the CRCV Publications Page for enhanced search capabilities.
Chhipa, Prakash Chandra; Chippa, Meenakshi Subhash; De, Kanjar; Saini, Rajkumar; Liwicki, Marcus; Shah, Mubarak
Möbius Transform for Mitigating Perspective Distortions in Representation Learning Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{nokey,
title = {Möbius Transform for Mitigating Perspective Distortions in Representation Learning},
author = {Prakash Chandra Chhipa and Meenakshi Subhash Chippa and Kanjar De and Rajkumar
Saini and Marcus Liwicki and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/MPD_ECCV2024_CameraReady.pdf
https://prakashchhipa.github.io/projects/mpd
https://youtu.be/MKh9NE_XEMY
https://www.crcv.ucf.edu/wp-content/uploads/2018/11/MPD_presentation_ECCV2024_final.pdf},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {Perspective distortion (PD) causes unprecedented changes in shape, size, orientation, angles, and other spatial relationships of visual concepts in images. Precisely estimating camera intrinsic and extrinsic parameters is a challenging task that prevents synthesizing perspective distortion. Non-availability of dedicated training data poses a critical barrier to developing robust computer vision methods. Additionally, distortion correction methods make other computer vision tasks a multi-step approach and lack performance. In this work, we propose
mitigating perspective distortion (MPD) by employing a fine-grained parameter control on a specific family of Möbius transform to model real-world distortion without estimating camera intrinsic and extrinsic parameters and without the need for actual distorted data. Also, we present a dedicated perspectively distorted benchmark dataset, ImageNet-PD, to benchmark the robustness of deep learning models against this new dataset. The proposed method outperforms existing benchmarks, ImageNet-E and ImageNet-X. Additionally, it significantly
improves performance on ImageNet-PD while consistently performing on standard data distribution. Notably, our method shows improved performance on three PD-affected real-world applicationsâcrowd counting, fisheye image recognition, and person re-identificationâand one PD-affected challenging CV task: object detection. The source code, dataset, and models are available on the project webpage at https://prakashchhipa.github.io/projects/mpd.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
mitigating perspective distortion (MPD) by employing a fine-grained parameter control on a specific family of Möbius transform to model real-world distortion without estimating camera intrinsic and extrinsic parameters and without the need for actual distorted data. Also, we present a dedicated perspectively distorted benchmark dataset, ImageNet-PD, to benchmark the robustness of deep learning models against this new dataset. The proposed method outperforms existing benchmarks, ImageNet-E and ImageNet-X. Additionally, it significantly
improves performance on ImageNet-PD while consistently performing on standard data distribution. Notably, our method shows improved performance on three PD-affected real-world applicationsâcrowd counting, fisheye image recognition, and person re-identificationâand one PD-affected challenging CV task: object detection. The source code, dataset, and models are available on the project webpage at https://prakashchhipa.github.io/projects/mpd.
Kang, Weitai; Liu, Gaowen; Shah, Mubarak; Yan, Yan
SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Kang2024,
title = {SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding},
author = {Weitai Kang and Gaowen Liu and Mubarak Shah and Yan Yan},
url = {https://arxiv.org/pdf/2407.03200
},
doi = {https://doi.org/10.48550/arXiv.2407.03200},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {Different from Object Detection, Visual Grounding deals with detecting a bounding box for each text-image pair. This one box for each text-image data provides sparse supervision signals. Although previous works achieve impressive results, their passive utilization of annotation, i.e. the sole use of the box annotation as regression ground truth, results in a suboptimal performance. In this paper, we present SegVG, a novel method transfers the box-level annotation as Segmentation signals to provide an additional pixel-level supervision for Visual Grounding. Specifically, we propose the Multi-layer Multi-task Encoder-Decoder as the target grounding stage, where we learn a regression query and multiple segmentation queries to ground the target by regression and segmentation of the box in each decoding layer, respectively. This approach allows us to iteratively exploit the annotation as signals for both box-level regression and pixel-level segmentation. Moreover, as the backbones are typically initialized by pretrained parameters learned from unimodal tasks and the queries for both regression and segmentation are static learnable embeddings, a domain discrepancy remains among these three types of features, which impairs subsequent target grounding. To mitigate this discrepancy, we introduce the Triple Alignment module, where the query, text, and vision tokens are triangularly updated to share the same space by triple attention mechanism. Extensive experiments on five widely used datasets validate our state-of-the-art (SOTA) performance.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Swetha, Sirnam; Yang, Jinyu; Neiman, Tal; Rizve, Mamshad Nayeem; Tran, Son; Yao, Benjamin; Chilimbi, Trishul; Shah, Mubarak
X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Swetha2024,
title = {X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs},
author = {Sirnam Swetha and Jinyu Yang and Tal Neiman and Mamshad Nayeem Rizve and Son Tran and Benjamin Yao and Trishul Chilimbi and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/2407.13851v1.pdf
https://arxiv.org/abs/2407.13851
https://swetha5.github.io/XFormer/},
doi = {https://doi.org/10.48550/arXiv.2407.13851},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {Recent advancements in Multimodal Large Language Models (MLLMs) have revolutionized the field of vision-language understanding by integrating visual perception capabilities into Large Language Models (LLMs). The prevailing trend in this field involves the utilization of a vision encoder derived from vision-language contrastive learning (CL), showing expertise in capturing overall representations while facing difficulties in capturing detailed local patterns. In this work, we focus on enhancing the visual representations for MLLMs by combining high-frequency and detailed visual representations, obtained through masked image modeling (MIM), with semantically-enriched low-frequency representations captured by CL. To achieve this goal, we introduce X-Former which is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM through an innovative interaction mechanism. Specifically, X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders, i.e., CLIP-ViT (CL-based) and MAE-ViT (MIM-based). It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM. To demonstrate the effectiveness of our approach, we assess its performance on tasks demanding detailed visual understanding. Extensive evaluations indicate that X-Former excels in visual reasoning tasks involving both structural and semantic categories in the GQA dataset. Assessment on fine-grained visual perception benchmark further confirms its superior capabilities in visual understanding.
},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Dave, Ishan Rajendrakumar; Rizve, Mamshad Nayeem; Shah, Mubarak
FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Dave2024,
title = {FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition},
author = {Ishan Rajendrakumar Dave and Mamshad Nayeem Rizve and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/finepsuedo_eccv24_dave.pdf
https://daveishan.github.io/finepsuedo-webpage/
https://youtu.be/bWOd8_JpjQs?si=WWRDhdg5ADWL0uwB},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {Real-life applications of action recognition often require a
fine-grained understanding of subtle movements, e.g., in sports analytics, user interactions in AR/VR, and surgical videos. Although fine-grained actions are more costly to annotate, existing semi-supervised action recognition has mainly focused on coarse-grained action recognition. Since fine-grained actions are more challenging due to the absence of scene bias, classifying these actions requires an understanding of action-phases. Hence, existing coarse-grained semi-supervised methods
do not work effectively. In this work, we for the first time thoroughly investigate semi-supervised fine-grained action recognition (FGAR). We observe that alignment distances like dynamic time warping (DTW) provide a suitable action-phase-aware measure for comparing fine-grained actions, a concept previously unexploited in FGAR. However, since regular DTW distance is pairwise and assumes strict alignment between pairs, it is not directly suitable for classifying fine-grained actions. To
utilize such alignment distances in a limited-label setting, we propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs. Our learnable alignability score provides a better phase-aware measure, which we use to refine the pseudo-labels of the primary video encoder. Our collaborative pseudolabeling-based framework âFinePseudoâ significantly outperforms prior methods on four fine-grained action recognition datasets: Diving48, FineGym99, FineGym288, and FineDiving, and shows improvement on existing coarse-grained datasets: Kinetics400 and Something-SomethingV2. We also demonstrate the robustness of our collaborative pseudo-labeling in handling novel unlabeled classes in open-world semi-supervised setups.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
fine-grained understanding of subtle movements, e.g., in sports analytics, user interactions in AR/VR, and surgical videos. Although fine-grained actions are more costly to annotate, existing semi-supervised action recognition has mainly focused on coarse-grained action recognition. Since fine-grained actions are more challenging due to the absence of scene bias, classifying these actions requires an understanding of action-phases. Hence, existing coarse-grained semi-supervised methods
do not work effectively. In this work, we for the first time thoroughly investigate semi-supervised fine-grained action recognition (FGAR). We observe that alignment distances like dynamic time warping (DTW) provide a suitable action-phase-aware measure for comparing fine-grained actions, a concept previously unexploited in FGAR. However, since regular DTW distance is pairwise and assumes strict alignment between pairs, it is not directly suitable for classifying fine-grained actions. To
utilize such alignment distances in a limited-label setting, we propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs. Our learnable alignability score provides a better phase-aware measure, which we use to refine the pseudo-labels of the primary video encoder. Our collaborative pseudolabeling-based framework âFinePseudoâ significantly outperforms prior methods on four fine-grained action recognition datasets: Diving48, FineGym99, FineGym288, and FineDiving, and shows improvement on existing coarse-grained datasets: Kinetics400 and Something-SomethingV2. We also demonstrate the robustness of our collaborative pseudo-labeling in handling novel unlabeled classes in open-world semi-supervised setups.
Gupta, Rohit; Rizve, Mamshad Nayeem; Tawari, Ashish; Unnikrishnan, Jayakrishnan; Tran, Son; Shah, Mubarak; benjamin Yao,; Chilimbi, Trishul
Open Vocabulary Multi-Label Video Classification Conference
2024.
@conference{Gupta2024,
title = {Open Vocabulary Multi-Label Video Classification},
author = {Rohit Gupta and Mamshad Nayeem Rizve and Ashish Tawari and Jayakrishnan Unnikrishnan and Son Tran and Mubarak Shah and benjamin Yao and Trishul Chilimbi},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/OVMLVidCLS_ECCV_2024_CameraReady-2.pdf
https://www.crcv.ucf.edu/wp-content/uploads/2018/11/OVMLVidCLS_ECCV_2024_Supplementary.pdf
https://arxiv.org/html/2407.09073v1#S1},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
abstract = {Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multi-label video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we integrate a temporal modeling module into CLIPâs vision encoder to effectively model the spatio-temporal dynamics of video concepts as well as propose a novel regularized finetuning technique to ensure strong open vocabulary classification performance in the video domain. Our extensive experimentation showcases the efficacy of our approach on multiple benchmark datasets.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Dave, Ishan Rajendrakumar; Heilbron, Fabian Caba; Shah, Mubarak; Jenni, Simon
Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets Conference
The 18th European Conference on Computer Vision ECCV 2024, Oral (Top 3%), 2024.
@conference{Dave2024b,
title = {Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets},
author = {Ishan Rajendrakumar Dave and Fabian Caba Heilbron and Mubarak Shah and Simon Jenni},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/avr_eccv24_dave.pdf
https://daveishan.github.io/avr-webpage/
https://youtu.be/6euQwz7XdQk?si=v12dQH4e7UrTUrIU},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024, Oral (Top 3%)},
abstract = {Temporal video alignment aims to synchronize the key events
like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approaches operate under the restrictive assumption that a suitable video pair for alignment is given, significantly limiting their broader applicability. To address this, we re-pose temporal alignment as a search problem and introduce the task of Alignable Video Retrieval (AVR). Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query. To achieve this, we make three key contributions: 1) we introduce DRAQ, a video alignability indicator to identify and re-rank the best alignable video from a set of candidates; 2) we propose an effective and generalizable frame-level video feature design to improve the alignment performance of several off-the-shelf feature representations, and 3) we propose a novel benchmark and evaluation protocol for AVR using cycle-consistency metrics. Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach in identifying alignable video pairs from diverse datasets. },
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approaches operate under the restrictive assumption that a suitable video pair for alignment is given, significantly limiting their broader applicability. To address this, we re-pose temporal alignment as a search problem and introduce the task of Alignable Video Retrieval (AVR). Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query. To achieve this, we make three key contributions: 1) we introduce DRAQ, a video alignability indicator to identify and re-rank the best alignable video from a set of candidates; 2) we propose an effective and generalizable frame-level video feature design to improve the alignment performance of several off-the-shelf feature representations, and 3) we propose a novel benchmark and evaluation protocol for AVR using cycle-consistency metrics. Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach in identifying alignable video pairs from diverse datasets.
Yang, Paiyu; Akhtar, Naveed; Shah, Mubarak; Mian, Ajmal
Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
BibTeX | Links:
@conference{Yang2024,
title = {Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density},
author = {Paiyu Yang and Naveed Akhtar and Mubarak Shah and Ajmal Mian},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/main_paper-1.pdf
https://arxiv.org/pdf/2407.04370
https://github.com/ypeiyu/input_density_reg},
year = {2024},
date = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Tai, Kai Sheng; Chen, Sirius; Shukla, Satya Narayan; Yu, Hanchao; Torr, Philip; Tian, Taipeng; Lim, Ser-Nam
uCAP: An Unsupervised Prompting Method for Vision-Language Models Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Tai2024,
title = {uCAP: An Unsupervised Prompting Method for Vision-Language Models},
author = {Kai Sheng Tai and Sirius Chen and Satya Narayan Shukla and Hanchao Yu and Philip Torr and Taipeng Tian and Ser-Nam Lim},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {This paper addresses a significant limitation that prevents Contrastive Language-Image Pretrained Models (CLIP) from achieving optimal performance on downstream image classification tasks. The key problem with CLIP-style zero-shot classification is that it requires domain-specific context in the form of prompts to better align the class descriptions to the downstream data distribution. In particular, prompts for vision-language models are domain-level texts (e.g., âa centered satellite image of ...â) which, together with the class names, are fed into the text encoder to provide more context for the downstream dataset. These prompts are typically manually tuned, which is time consuming and often sub-optimal. To overcome this bottleneck, this paper proposes uCAP, a method to automatically learn domain-specific prompts/contexts using only unlabeled in-domain images. We achieve this by modeling the generation of images given the class names and a domain-specific prompt with an unsupervised likelihood distribution, and then performing inference of the prompts. We validate the proposed method across various models and datasets, showing that uCAP consistently outperforms manually tuned prompts and related baselines on the evaluated datasets: ImageNet, CIFAR-10, CIFAR-100, OxfordPets (up to 2%), SUN397 (up to 5%), and Caltech101 (up to 3%).},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Chen, Hao; Xie, Saining; Lim, Ser-Nam; Shrivastava, Abhinav
Fast Encoding and Decoding for Implicit Video Representation Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Chen2024b,
title = {Fast Encoding and Decoding for Implicit Video Representation},
author = {Hao Chen and Saining Xie and Ser-Nam Lim and Abhinav Shrivastava},
year = {2024},
date = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Jang, Young Kyun; Huynh, Dat; Shah, Ashish; Chen, Wen-Kai; Lim, Ser-Nam
Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Retrieval Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Jang2024,
title = {Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Retrieval},
author = {Young Kyun Jang and Dat Huynh and Ashish Shah and Wen-Kai Chen and Ser-Nam Lim},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Sun, Guangyu; Mendieta, Matias; Dutta, Aritra; Li, Xin; Chen, Chen
Towards Multi-modal Transformers in Federated Learning Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
BibTeX | Links:
@conference{Sun2024,
title = {Towards Multi-modal Transformers in Federated Learning},
author = {Guangyu Sun and Matias Mendieta and Aritra Dutta and Xin Li and Chen Chen},
url = {https://arxiv.org/pdf/2404.12467.pdf
https://github.com/imguangyu/FedCola},
year = {2024},
date = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Li, Ming; Yang, Taojiannan; Kuang, Huafeng; Wu, Jie; Wang, Zhaoning; Xiao, Xuefeng; Chen, Chen
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Li2024,
title = {ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback},
author = {Ming Li and Taojiannan Yang and Huafeng Kuang and Jie Wu and Zhaoning Wang and Xuefeng Xiao and Chen Chen},
url = {https://arxiv.org/pdf/2404.07987.pdf
https://liming-ai.github.io/ControlNet_Plus_Plus/
https://github.com/liming-ai/ControlNet_Plus_Plus
https://huggingface.co/spaces/limingcv/ControlNet-Plus-Plus},
year = {2024},
date = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {To enhance the controllability of text-to-image diffusion models, existing efforts like ControlNet incorporated image-based conditional controls. In this paper, we reveal that existing methods still face significant challenges in generating images that align with the image conditional controls. To this end, we propose ControlNet++, a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. Specifically, for an input conditional control, we use a pre-trained discriminative reward model to extract the corresponding condition of the generated images, and then optimize the consistency loss between the input conditional control and extracted condition. A straightforward implementation would be generating images from random noises and then calculating the consistency loss, but such an approach requires storing gradients for multiple sampling timesteps, leading to considerable time and memory costs. To address this, we introduce an efficient reward strategy that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning. This avoids the extensive costs associated with image sampling, allowing for more efficient reward fine-tuning. Extensive experiments show that ControlNet++ significantly improves controllability under various conditional controls. For example, it achieves improvements over ControlNet by 7.9% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Pinyoanuntapong, Ekkasit; Saleem, Muhammad Usama; Wang, Pu; Lee, Minwoo; Das, Srijan; Chen, Chen
BAMM: Bidirectional Autoregressive Motion Model Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{nokey,
title = {BAMM: Bidirectional Autoregressive Motion Model},
author = {Ekkasit Pinyoanuntapong and Muhammad Usama Saleem and Pu Wang and Minwoo Lee and Srijan Das and Chen Chen},
url = {https://arxiv.org/pdf/2403.19435.pdf
https://exitudio.github.io/BAMM-page/},
year = {2024},
date = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {Generating human motion from text has been dominated by denoising motion models either through diffusion or generative masking process. However, these models face great limitations in usability by requiring prior knowledge of the motion length. Conversely, autoregressive motion models address this limitation by adaptively predicting motion endpoints, at the cost of degraded generation quality and editing capabilities. To address these challenges, we propose Bidirectional Autoregressive Motion Model (BAMM), a novel text-to-motion generation framework. BAMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and (2) a masked self-attention transformer that autoregressively predicts randomly masked tokens via a hybrid attention masking strategy. By unifying generative masked modeling and autoregressive modeling, BAMM captures rich and bidirectional dependencies among motion tokens, while learning the probabilistic mapping from textual inputs to motion outputs with dynamically-adjusted motion sequence length. This feature enables BAMM to simultaneously achieving high-quality motion generation with enhanced usability and built-in motion editability. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that BAMM surpasses current state-of-the-art methods in both qualitative and quantitative measures.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Khalid, Umar; Iqbal, Hasan; Farooq, Azib; Hua, Jing; Chen, Chen
3DEgo: 3D Editing on the Go! Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Khalid2024,
title = {3DEgo: 3D Editing on the Go!},
author = {Umar Khalid and Hasan Iqbal and Azib Farooq and Jing Hua and Chen Chen},
url = {https://arxiv.org/pdf/2407.10102
https://3dego.github.io/},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {We introduce 3DEgo to address a novel problem of directly synthesizing photorealistic 3D scenes from monocular videos guided by textual prompts. Conventional methods construct a text-conditioned 3D scene through a three-stage process, involving pose estimation using Structure-from-Motion (SfM) libraries like COLMAP, initializing the 3D model with unedited images, and iteratively updating the dataset with edited images to achieve a 3D scene with text fidelity. Our framework streamlines the conventional multi-stage 3D editing process into a single-stage workflow by overcoming the reliance on COLMAP and eliminating the cost of model initialization. We apply a diffusion model to edit video frames prior to 3D scene creation by incorporating our designed noise blender module for enhancing multi-view editing consistency, a step that does not require additional training or fine-tuning of T2I diffusion models. 3DEgo utilizes 3D Gaussian Splatting to create 3D scenes from the multi-view consistent edited frames, capitalizing on the inherent temporal continuity and explicit point cloud data. 3DEgo demonstrates remarkable editing precision, speed, and adaptability across a variety of video sources, as validated by extensive evaluations on six datasets, including our own prepared GS25 dataset.
},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Khalid, Umar; Iqbal, Hasan; Tayyab, Muhammad; Karim, Md Nazmul; Hua, Jing; Chen, Chen
LatentEditor: Text Driven Local Editing of 3D Scenes Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Khalid2024b,
title = {LatentEditor: Text Driven Local Editing of 3D Scenes},
author = {Umar Khalid and Hasan Iqbal and Muhammad Tayyab and Md Nazmul Karim and Jing Hua and Chen Chen},
url = {https://arxiv.org/pdf/2312.09313.pdf
https://latenteditor.github.io/},
year = {2024},
date = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {While neural fields have made significant strides in view synthesis and scene reconstruction, editing them poses a formidable challenge due to their implicit encoding of geometry and texture information from multi-view inputs. In this paper, we introduce LatenEditor, an innovative framework designed to empower users with the ability to perform precise and locally controlled editing of neural fields using text prompts. Leveraging denoising diffusion models, we successfully embed real-world scenes into the latent space, resulting in a faster and more adaptable NeRF backbone for editing compared to traditional methods. To enhance editing precision, we introduce a delta score to calculate the 2D mask in the latent space that serves as a guide for local modifications while preserving irrelevant regions. Our novel pixel-level scoring approach harnesses the power of InstructPix2Pix (IP2P) to discern the disparity between IP2P conditional and unconditional noise predictions in the latent space. The edited latents conditioned on the 2D masks are then iteratively updated in the training set to achieve 3D local editing. Our approach achieves faster editing speeds and superior output quality compared to existing 3D editing models, bridging the gap between textual instructions and high-quality 3D scene editing in latent space.
},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Karim, Md Nazmul; Iqbal, Hasan; Khalid, Umar; Chen, Chen; Hua, Jing
Free-Editor: Zero-shot Text-driven 3D Scene Editing Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Karim2024,
title = {Free-Editor: Zero-shot Text-driven 3D Scene Editing},
author = {Md Nazmul Karim and Hasan Iqbal and Umar Khalid and Chen Chen and Jing Hua
},
url = {https://arxiv.org/pdf/2312.13663.pdf
https://free-editor.github.io/},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {Text-to-Image (T2I) diffusion models have gained popularity recently due to their multipurpose and easy-to-use nature, e.g. image and video generation as well as editing. However, training a diffusion model specifically for 3D scene editing is not straightforward due to the lack of large-scale datasets. To date, editing 3D scenes requires either re-training the model to adapt to various 3D edited scenes or design-specific methods for each special editing type. Furthermore, state-of-the-art (SOTA) methods require multiple synchronized edited images from the same scene to facilitate the scene editing. Due to the current limitations of T2I models, it is very challenging to apply consis- tent editing effects to multiple images, i.e. multi-view inconsistency in editing. This in turn compromises the desired 3D scene editing performance if these images are used. In our work, we propose a novel training-free 3D scene editing technique, FREE-EDITOR, which allows users to edit 3D scenes without further re-training the model during test time. Our proposed method successfully avoids the multi- view style inconsistency issue in SOTA methods with the help of a âsingle-view editingâ scheme. Specifically, we show that editing a particular 3D scene can be performed by only modifying a single view. To this end, we introduce an Edit Transformer that enforces intra-view consistency and inter-view style transfer by utilizing self- and cross- attention, respectively. Since it is no longer required to re-train the model and edit every view in a scene, the editing time, as well as memory resources, are reduced significantly, e.g., the runtime being ⌠20Ă faster than SOTA. We have conducted extensive experiments on a wide range of benchmark datasets and achieve diverse editing capabilities with our proposed technique.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Fang, Xiang; Xiong, Zeyu; Fang, Wanlong; Qu, Xiaoye; Chen, Chen; Dong, Jianfeng; Tang, Keke; Zhou, Pan; Cheng, Yu; Liu, Daizong
Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Fang2024,
title = {Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective},
author = {Xiang Fang and Zeyu Xiong and Wanlong Fang and Xiaoye Qu and Chen Chen and Jianfeng Dong and Keke Tang and Pan Zhou and Yu Cheng and Daizong Liu},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/ECCV2024_Grounding_camera.pdf
https://eccv2024.ecva.net/virtual/2024/poster/1833},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment candidate selection pipeline that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moments. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: (1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. (2) Complex moment candidates: the performance of these methods severely relies on the quality of moment candidates, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each frame-word pair with diverse granularity and flexible combination for fine-grained cross-modal interaction. Specifically, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. At last, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for fine-grained moment boundary grounding. Experiments show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Pillai, Manu S; Rizve, Mamshad Nayeem; Shah, Mubarak
GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Pillai2024,
title = {GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers},
author = {Manu S Pillai and Mamshad Nayeem Rizve and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/07875-supp.pdf
https://arxiv.org/abs/2408.02840
https://github.com/manupillai308/GAReT},
doi = {https://doi.org/10.48550/arXiv.2408.02840},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {Cross-view video geo-localization (CVGL) aims to derive GPS trajectories from street-view videos by aligning them with aerial-view images. Despite their promising performance, current CVGL methods face significant challenges. These methods use camera and odometry data, typically absent in real-world scenarios. They utilize multiple adjacent frames and various encoders for feature extraction, resulting in high computational costs. Moreover, these approaches independently predict each street-view frame's location, resulting in temporally inconsistent GPS trajectories. To address these challenges, in this work, we propose GAReT, a fully transformer-based method for CVGL that does not require camera and odometry data. We introduce GeoAdapter, a transformer-adapter module designed to efficiently aggregate image-level representations and adapt them for video inputs. Specifically, we train a transformer encoder on video frames and aerial images, then freeze the encoder to optimize the GeoAdapter module to obtain video-level representation. To address temporally inconsistent trajectories, we introduce TransRetriever, an encoder-decoder transformer model that predicts GPS locations of street-view frames by encoding top-k nearest neighbor predictions per frame and auto-regressively decoding the best neighbor based on the previous frame's predictions. Our method's effectiveness is validated through extensive experiments, demonstrating state-of-the-art performance on benchmark datasets.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Kulkarni, Parth Parag; Nayak, Guarav Kumar; Shah, Mubarak
CityGuessr: City-Level Video Geo-Localization on a Global Scale Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{Kulkarni2024,
title = {CityGuessr: City-Level Video Geo-Localization on a Global Scale},
author = {Parth Parag Kulkarni and Guarav Kumar Nayak and Mubarak Shah},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/CityGuessr.pdf
https://parthpk.github.io/cityguessr-webpage/
},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {Video geolocalization is a crucial problem in current times. Given just a video, ascertaining where it was captured from can have a plethora of advantages. The problem of worldwide geolocalization has been tackled before, but only using the image modality. Its video counterpart remains relatively unexplored. Meanwhile, video geolocalization has also garnered some attention in the recent past, but the existing methods are all restricted to specific regions. This motivates us to explore the problem of video geolocalization at a global scale. Hence, we propose a novel problem of worldwide video geolocalization with the objective of hierarchically predicting the correct city, state/province, country, and continent, given a video. However, no large scale video datasets that have extensive worldwide coverage exist, to train models for solving this problem. To this end, we introduce a new dataset, âCityGuessr68kâ comprising of 68,269 videos from 166 cities all over the world. We also propose a novel baseline approach to this problem, by designing a transformerbased architecture comprising of an elegant âSelf-Cross Attentionâ module for incorporating scenes as well as a âTextLabel Alignmentâ strategy for distilling knowledge from textlabels in feature space. To further enhance our location prediction, we also utilize soft-scene labels. Finally we demonstrate the performance of our method on our new dataset as well as Mapillary(MSLS) [38]. },
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Karim, Nazmul; Arafat, Abdullah Al; Khalid, Umar; Guo, Zhishan; Rahnavard, Nazanin
Augmented Neural Fine-tuning for Efficient Backdoor Purification Conference
The 18th European Conference on Computer Vision ECCV 2024, 2024.
@conference{nokey,
title = {Augmented Neural Fine-tuning for Efficient Backdoor Purification},
author = {Nazmul Karim and Abdullah Al Arafat and Umar Khalid and Zhishan Guo and Nazanin Rahnavard},
url = {https://www.crcv.ucf.edu/wp-content/uploads/2018/11/2024_ECCV_NFT.pdf
https://arxiv.org/pdf/2407.10052
https://github.com/nazmul-karim170/NFT-Augmented-Backdoor-Purification},
year = {2024},
date = {2024-09-29},
urldate = {2024-09-29},
publisher = {The 18th European Conference on Computer Vision ECCV 2024},
abstract = {Recent studies have revealed the vulnerability of deep neural networks (DNNs) to various backdoor attacks, where the behavior of DNNs can be compromised by utilizing certain types of triggers or poisoning mechanisms. State-of-the-art (SOTA) defenses employ too sophisticated mechanisms that require either a computationally expensive adversarial search module for reverse-engineering the trigger distribution or an over-sensitive hyper-parameter selection module. Moreover, they offer sub-par performance in challenging scenarios, e.g., limited validation data and strong attacks. In this paper, we proposeâNeural mask Fine-Tuning (NFT)âwith an aim to optimally re-organize the neuron activities in a way that the effect of the backdoor is removed. Utilizing a simple data augmentation like MixUp, NFT relaxes the trigger synthesis process and eliminates the requirement of the adversarial search module. Our study further reveals that direct weight fine-tuning under limited validation data results in poor post-purification clean test accuracy, primarily due to overfitting issue. To overcome this, we propose to fine-tune neural masks instead of model weights. In addition, a mask regularizer
has been devised to further mitigate the model drift during the purification process. The distinct characteristics of NFT render it highly efficient in both runtime and sample usage, as it can remove the backdoor even when a single sample is available from each class. We validate the effectiveness of NFT through extensive experiments covering the tasks of image classification, object detection, video action recognition, 3D point cloud, and natural language processing. We evaluate our method against 14 different attacks (LIRA, WaNet, etc.) on 11 benchmark data sets (ImageNet, UCF101, Pascal VOC, ModelNet, OpenSubtitles2012, etc.). },
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
has been devised to further mitigate the model drift during the purification process. The distinct characteristics of NFT render it highly efficient in both runtime and sample usage, as it can remove the backdoor even when a single sample is available from each class. We validate the effectiveness of NFT through extensive experiments covering the tasks of image classification, object detection, video action recognition, 3D point cloud, and natural language processing. We evaluate our method against 14 different attacks (LIRA, WaNet, etc.) on 11 benchmark data sets (ImageNet, UCF101, Pascal VOC, ModelNet, OpenSubtitles2012, etc.).