UCF researchers have 20 papers accepted into the Thirty-Ninth Annual Conference on Neural Information Processing Systems, that will be held at the San Diego Convention Center Tuesday, December 2 through Sunday, December 7.
The conference was founded in 1987 and is now a multi-track interdisciplinary annual meeting that includes invited talks, demonstrations, symposia, and oral and poster presentations of refereed papers. Along with the conference is a professional exposition focusing on machine learning in practice, a series of tutorials, and topical workshops that provide a less formal setting for the exchange of ideas.
The h5-index is the h-index for articles published in the last 5 complete years. According to Google Scholar Metrics, NeurIPS is 7th overall and ranked 1st in the Artificial Intelligence subcategory in the h5-index rankings.
You can access the CRCV Publications Page and Aii Publications Page for enhanced search capabilities.
Hu, Zixuan; Shen, Li; Wang, Zhenyi; Wei, Yongxian; Tao, Dacheng
Adaptive Defense against Harmful Fine-Tuning via Bayesian Data Scheduler Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Hu2025b,
title = {Adaptive Defense against Harmful Fine-Tuning via Bayesian Data Scheduler},
author = {Zixuan Hu and Li Shen and Zhenyi Wang and Yongxian Wei and Dacheng Tao},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models. Existing defense strategies preemptively build robustness via attack simulation but suffer from fundamental limitations: (i) the infeasibility of performing attack simulation due to lacking prior knowledge about potential attack data, and (ii) limited adaptability to varying attack settings, as simulation fails to capture their variability and complexity. To address these challenges, we propose Bayesian Data Scheduler (BDS), an adaptive tuning-stage defense strategy with no need for attack simulation. BDS formulates harmful fine-tuning defense as a Bayesian inference problem, learning the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets. The fine-tuning process is then constrained by weighting data with their safety attributes sampled from the posterior, thus mitigating the influence of harmful data. By leveraging the post hoc nature of Bayesian inference, the posterior is conditioned on the fine-tuning dataset, enabling BDS to tailor its defense to the specific dataset, thereby achieving adaptive defense. Furthermore, we introduce a neural scheduler based on amortized Bayesian learning, enabling efficient transfer to new data without retraining. Comprehensive results across diverse attack and defense settings demonstrate the state-of-the-art performance of our approach.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Tola, Astrit; Taiwo, Funmilola Mary; Akcora, Cuneyt Gurcan; Coskunuzer, Baris
TopER: Topological Embeddings in Graph Representation Learning Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Tola2025,
title = {TopER: Topological Embeddings in Graph Representation Learning},
author = {Astrit Tola and Funmilola Mary Taiwo and Cuneyt Gurcan Akcora and Baris Coskunuzer},
url = {https://arxiv.org/abs/2410.01778},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Graph embeddings play a critical role in graph representation learning, allowing machine learning models to explore and interpret graph-structured data. However, existing methods often rely on opaque, high-dimensional embeddings, limiting interpretability and practical visualization.
In this work, we introduce Topological Evolution Rate (TopER), a novel, low-dimensional embedding approach grounded in topological data analysis. TopER simplifies a key topological approach, Persistent Homology, by calculating the evolution rate of graph substructures, resulting in intuitive and interpretable visualizations of graph data. This approach not only enhances the exploration of graph datasets but also delivers competitive performance in graph clustering and classification tasks. Our TopER-based models achieve or surpass state-of-the-art results across molecular, biological, and social network datasets in tasks such as classification, clustering, and visualization.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
In this work, we introduce Topological Evolution Rate (TopER), a novel, low-dimensional embedding approach grounded in topological data analysis. TopER simplifies a key topological approach, Persistent Homology, by calculating the evolution rate of graph substructures, resulting in intuitive and interpretable visualizations of graph data. This approach not only enhances the exploration of graph datasets but also delivers competitive performance in graph clustering and classification tasks. Our TopER-based models achieve or surpass state-of-the-art results across molecular, biological, and social network datasets in tasks such as classification, clustering, and visualization.
Shamsi, Kiarash; Ngo, Tran Gia Bao; Shirzadkhani, Razieh; Huang, Shenyang; Poursafaei, Farimah; Azad, Poupak; Rabbany, Reihaneh; Coskunuzer, Baris; Rabusseau, Guillaume; Akcora, Cuneyt Gurcan
MiNT: Multi-Network Transfer Benchmark for Temporal Graph Learning Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Shamsi2025,
title = {MiNT: Multi-Network Transfer Benchmark for Temporal Graph Learning},
author = {Kiarash Shamsi and Tran Gia Bao Ngo and Razieh Shirzadkhani and Shenyang Huang and Farimah Poursafaei and Poupak Azad and Reihaneh Rabbany and Baris Coskunuzer and Guillaume Rabusseau and Cuneyt Gurcan Akcora},
url = {https://neurips.cc/virtual/2025/poster/121574
https://github.com/benjaminnNgo/ScalingTGNs},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Temporal Graph Learning (TGL) aims to discover patterns in evolving networks or temporal graphs and leverage these patterns to predict future interactions. However, most existing research focuses on learning from a single network in isolation, leaving the challenges of within-domain and cross-domain generalization largely unaddressed. In this study, we introduce a new benchmark of 84 real-world temporal transaction networks and propose Temporal Multi-network Transfer (MiNT), a pre-training framework designed to capture transferable temporal dynamics across diverse networks. We train MiNT models on up to 64 transaction networks and evaluate their generalization ability on 20 held-out, unseen networks. Our results show that MiNT consistently outperforms individually trained models, revealing a strong relation between the number of pre-training networks and transfer performance. These findings highlight scaling trends in temporal graph learning and underscore the importance of network diversity in improving generalization. This work establishes the first large-scale benchmark for studying transferability in TGL and lays the groundwork for developing Temporal Graph Foundation Models. Our code is available at url{https://github.com/benjaminnNgo/ScalingTGNs}},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Lyu, Zonglin; Li, Ming; Liu, Xinxin; Chen, Chen
CPO: Condition Preference Optimization for Controllable Image Generation Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Lyu2025b,
title = {CPO: Condition Preference Optimization for Controllable Image Generation},
author = {Zonglin Lyu and Ming Li and Xinxin Liu and Chen Chen},
url = {https://neurips.cc/virtual/2025/poster/117815},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {To enhance controllability in text-to-image generation, ControlNet introduces image-based control signals, while ControlNet++ improves pixel-level cycle consistency between generated images and the input control signal. To avoid the prohibitive cost of back-propagating through the sampling process, ControlNet++ optimizes only low-noise timesteps (e.g.,
) using a single-step approximation, which not only ignores the contribution of high-noise timesteps but also introduces additional approximation errors. A straightforward alternative for optimizing controllability across all timesteps is Direct Preference Optimization (DPO), a fine-tuning method that increases model preference for more controllable images (
) over less controllable ones (
). However, due to uncertainty in generative models, it is difficult to ensure that win--lose image pairs differ only in controllability while keeping other factors, such as image quality, fixed. To address this, we propose performing preference learning over control conditions rather than generated images. Specifically, we construct winning and losing control signals,
and
, and train the model to prefer
. This method, which we term textit{Condition Preference Optimization} (CPO), eliminates confounding factors and yields a low-variance training objective. Our approach theoretically exhibits lower contrastive loss variance than DPO and empirically achieves superior results. Moreover, CPO requires less computation and storage for dataset curation. Extensive experiments show that CPO significantly improves controllability over the state-of-the-art ControlNet++ across multiple control types: over
% error rate reduction in segmentation,
-
% in human pose, and consistent
%-
% reductions in edge and depth maps. Here, the error rate is defined as the difference between evaluated controllability and oracle results.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
) using a single-step approximation, which not only ignores the contribution of high-noise timesteps but also introduces additional approximation errors. A straightforward alternative for optimizing controllability across all timesteps is Direct Preference Optimization (DPO), a fine-tuning method that increases model preference for more controllable images (
) over less controllable ones (
). However, due to uncertainty in generative models, it is difficult to ensure that win--lose image pairs differ only in controllability while keeping other factors, such as image quality, fixed. To address this, we propose performing preference learning over control conditions rather than generated images. Specifically, we construct winning and losing control signals,
and
, and train the model to prefer
. This method, which we term textit{Condition Preference Optimization} (CPO), eliminates confounding factors and yields a low-variance training objective. Our approach theoretically exhibits lower contrastive loss variance than DPO and empirically achieves superior results. Moreover, CPO requires less computation and storage for dataset curation. Extensive experiments show that CPO significantly improves controllability over the state-of-the-art ControlNet++ across multiple control types: over
% error rate reduction in segmentation,
-
% in human pose, and consistent
%-
% reductions in edge and depth maps. Here, the error rate is defined as the difference between evaluated controllability and oracle results.
Zhang, Yancheng; Sun, Guangyu; Chen, Chen
EGGS: Exchangeable 2D/3D Gaussian Splatting for Geometry-Appearance Balanced Novel View Synthesis Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Zhang2025d,
title = {EGGS: Exchangeable 2D/3D Gaussian Splatting for Geometry-Appearance Balanced Novel View Synthesis},
author = {Yancheng Zhang and Guangyu Sun and Chen Chen},
url = {https://neurips.cc/virtual/2025/poster/120173},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Novel view synthesis (NVS) is crucial in computer vision and graphics, with wide applications in AR, VR, and autonomous driving. While 3D Gaussian Splatting (3DGS) enables real-time rendering with high appearance fidelity, it suffers from multi-view inconsistencies, limiting geometric accuracy. In contrast, 2D Gaussian Splatting (2DGS) enforces multi-view consistency but compromises texture details. To address these limitations, we propose Exchangeable Gaussian Splatting (EGGS), a hybrid representation that integrates 2D and 3D Gaussians to balance appearance and geometry. To achieve this, we introduce Hybrid Gaussian Rasterization for unified rendering, Adaptive Type Exchange for dynamic adaptation between 2D and 3D Gaussians, and Frequency-Decoupled Optimization that effectively exploits the strengths of each type of Gaussian representation. Our CUDA-accelerated implementation ensures efficient training and inference. Extensive experiments demonstrate that EGGS outperforms existing methods in rendering quality, geometric accuracy, and efficiency, providing a practical solution for high-quality NVS.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Qu, Huaizhi; Choi, Inyoung; Tan, Zhen; Wang, Song; Yun, Sukwon; Long, Qi; Siddiqui, Faizan; Lee, Kwonjoon; Chen, Tianlong
BetaConform: Efficient MAP Estimation of LLM Ensemble Judgment Performance with Prior Transfer Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Qu2025,
title = {BetaConform: Efficient MAP Estimation of LLM Ensemble Judgment Performance with Prior Transfer},
author = {Huaizhi Qu and Inyoung Choi and Zhen Tan and Song Wang and Sukwon Yun and Qi Long and Faizan Siddiqui and Kwonjoon Lee and Tianlong Chen},
url = {https://arxiv.org/abs/2504.12589},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {LLM ensembles are widely used for LLM judges. However, how to estimate their accuracy, especially in an efficient way, is unknown. In this paper, we present a principled maximum a posteriori (MAP) framework for an economical and precise estimation of the performance of LLM ensemble judgment. We first propose a mixture of Beta-Binomial distributions to model the judgment distribution, revising from the vanilla Binomial distribution. Next, we introduce a conformal prediction-driven approach that enables adaptive stopping during iterative sampling to balance accuracy with efficiency. Furthermore, we design a prior transfer mechanism that utilizes learned distributions on open-source datasets to improve estimation on a target dataset when only scarce annotations are available. Finally, we present BetaConform, a framework that integrates our distribution assumption, adaptive stopping, and the prior transfer mechanism to deliver a theoretically guaranteed distribution estimation of LLM ensemble judgment with minimum labeled samples. BetaConform is also validated empirically. For instance, with only 10 samples from the TruthfulQA dataset, for a Llama ensembled judge, BetaConform gauges its performance with error margin as small as 3.37%.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Hu, Tianyu; Tan, Zhen; Wang, Song; Qu, Huaizhi; Chen, Tianlong
Multi-Agent Debate for LLM Judges with Adaptive Stability Detection Conference
2025.
@conference{Hu2025c,
title = {Multi-Agent Debate for LLM Judges with Adaptive Stability Detection},
author = {Tianyu Hu and Zhen Tan and Song Wang and Huaizhi Qu and Tianlong Chen},
url = {https://neurips.cc/virtual/2025/poster/117644},
year = {2025},
date = {2025-11-30},
abstract = {With advancements in reasoning capabilities, Large Language Models (LLMs) are increasingly employed for automated judgment tasks.While LLMs-as-Judges offer promise in automating evaluations, current approaches often rely on simplistic aggregation methods (e.g., majority voting), which can fail even when individual agents provide correct answers.To address this, we propose a multi-agent debate judge framework where agents collaboratively reason and iteratively refine their responses. We formalize the debate process mathematically, analyzing agent interactions and proving that debate amplifies correctness compared to static ensembles.To enhance efficiency, we introduce a stability detection mechanism that models judge consensus dynamics via a time-varying Beta-Binomial mixture, with adaptive stopping based on distributional similarity (Kolmogorov-Smirnov test).This mechanism models the judges' collective correct rate dynamics using a time-varying mixture of Beta-Binomial distributions and employs an adaptive stopping criterion based on distributional similarity (Kolmogorov-Smirnov statistic). Experiments across multiple benchmarks and models demonstrate that our framework improves judgment accuracy over majority voting while maintaining computational efficiency.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
He, Yinhan; Zheng, Wendy; Wang, Song; Zheng, Zaiyi; Dong, Yushun; Zhu, Yaochen; Li, Jundong
Hierarchical Demonstration Order Optimization for Many-shot In-Context Learning Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{He2025,
title = {Hierarchical Demonstration Order Optimization for Many-shot In-Context Learning},
author = {Yinhan He and Wendy Zheng and Song Wang and Zaiyi Zheng and Yushun Dong and Yaochen Zhu and Jundong Li},
url = {https://neurips.cc/virtual/2025/poster/119561
https://anonymous.4open.science/r/HIDO-B2DE/},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {In-Context Learning (ICL) is a technique where large language models (LLMs) leverage multiple demonstrations (i.e., examples) to perform tasks. With the recent expansion of LLM context windows, many-shot ICL (generally with more than 50 demonstrations) can lead to significant performance improvements on a variety of language tasks such as text classification and question answering.Nevertheless, ICL faces the issue of demonstration order instability (ICL-DOI), which means that performance varies significantly depending on the order of demonstrations. Moreover, ICL-DOI persists in many-shot ICL, validated by our thorough experimental investigation.Current strategies for handling ICL-DOI are not applicable to many-shot ICL due to two critical challenges: (1) Most existing methods assess demonstration order quality by first prompting the LLM, then using heuristic metrics based on the LLM's predictions. In the many-shot scenarios, these metrics without theoretical grounding become unreliable, where the LLMs struggle to effectively utilize information from long input contexts, making order distinctions less clear.(2) The requirement to examine all orders for the large number of demonstrations is computationally infeasible due to the super-exponential complexity of the order space in many-shot ICL. To tackle the first challenge, we design a demonstration order evaluation metric based on information theory for measuring order quality, which effectively quantifies the usable information gain of a given demonstration order.To address the second challenge, we propose a hierarchical demonstration order optimization method named HIDO that enables a more refined exploration of the order space, achieving high ICL performance without the need to evaluate all possible orders. Extensive experiments on multiple LLMs and real-world datasets demonstrate that our HIDO method consistently and efficiently outperforms other baselines. Our code project can be found at https://anonymous.4open.science/r/HIDO-B2DE/.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Gupta, Animesh; Parmar, Jay; Dave, Ishan Rajendrakumar; Shah, Mubarak
From Play to Replay: Composed Video Retrieval for Sports Highlights Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Gupta2025,
title = {From Play to Replay: Composed Video Retrieval for Sports Highlights},
author = {Animesh Gupta and Jay Parmar and Ishan Rajendrakumar Dave and Mubarak Shah},
url = {https://neurips.cc/virtual/2025/poster/121717
https://animesh-007.github.io/TF-CoVR-WEBSITE/},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 1.8 M triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state of the art from 19.83 to 25.82.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Carnemolla, Simone; Pennisi, Matteo; Samarasinghe, Sarinda; Bellitto, Giovanni; Palazzo, Simone; Giordano, Daniela; Shah, Mubarak; Spampinato, Concetto
DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{nokey,
title = {DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models},
author = {Simone Carnemolla and Matteo Pennisi and Sarinda Samarasinghe and Giovanni Bellitto and Simone Palazzo and Daniela Giordano and Mubarak Shah and Concetto Spampinato},
url = {https://neurips.cc/virtual/2025/poster/117167},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Understanding and explaining the behavior of machine learning models is essential for building transparent and trustworthy AI systems. We introduce DEXTER, a data-free framework that combines diffusion models and large language models to generate global, textual explanations of visual classifiers. DEXTER operates by optimizing text prompts to synthesize class-conditional images that strongly activate a target classifier. These synthetic samples are then used to elicit detailed natural language reports that describe class-specific decision patterns and biases. Unlike prior work, DEXTER enables natural language reasoning about a classifier's decision process without access to training data or ground-truth labels. We demonstrate DEXTER's flexibility across three tasks—activation maximization, slice discovery and debiasing, and bias explanation—each illustrating its ability to uncover the internal mechanisms of visual classifiers. Quantitative and qualitative evaluations, including a user study, show that DEXTER produces accurate, interpretable outputs. Experiments on ImageNet, Waterbirds, CelebA, and FairFaces confirm that DEXTER outperforms existing approaches in global model explanation and class-level bias reporting.
},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Ghosh, Ipsita; Nguyen, Ethan; KĂĽmmerle, Christian
Q3R: Quadratic Reweighted Rank Regularizer for Effective Low-Rank Training Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Ghosh2025,
title = {Q3R: Quadratic Reweighted Rank Regularizer for Effective Low-Rank Training},
author = {Ipsita Ghosh and Ethan Nguyen and Christian KĂĽmmerle},
url = {https://neurips.cc/virtual/2025/poster/117315},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Parameter-efficient training, based on low-rank optimization, has become a highly successful tool for fine-tuning large deep-learning models. However, these methods fail at low-rank pretraining tasks where maintaining the low-rank structure and the objective remains a challenging task. We propose the Quadratic Reweighted Rank Regularizer dubbed QER, which leads to a novel low-rank inducing training strategy inspired by the iteratively reweighted least squares (IRLS) framework. QER is based on a quadratic regularizer term which majorizes a smoothed log determinant serving as rank surrogate objective. Unlike other low-rank training techniques, QER is able to train weight matrices with prescribed, low target ranks of models that achieve comparable predictive performance as dense models, with small computational overhead, while remaining fully compatible with existing architectures. In experiments, we are able to truncate 60% of the parameters of a ViT-Tiny parameters with marginal loss in CIFAR-10 performance and up to 80% with only 4% accuracy drop. The efficacy of QER is confirmed on Transformers across both image and language tasks. To demonstrate QER task agnosticism, we fine-tune RoBERTa using QER regularized dense layers models on GLUE tasks, achieving performance comparable to state-of-the-art low-rank adapters.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Zhao, Zhenghao; Wang, Haoxuan; Wu, Junyi; Shang, Yuzhang; Liu, Gaowen; Yan, Yan
Efficient Multimodal Dataset Distillation via Generative Models Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Zhao2025,
title = {Efficient Multimodal Dataset Distillation via Generative Models},
author = {Zhenghao Zhao and Haoxuan Wang and Junyi Wu and Yuzhang Shang and Gaowen Liu and Yan Yan},
url = {https://neurips.cc/virtual/2025/poster/119089},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Dataset distillation aims to synthesize a small dataset from a large dataset, enabling the model trained on it to perform well on the original dataset. With the blooming of large language models and multimodal large language models, the importance of multimodal datasets, particularly image-text datasets, has grown significantly. However, existing multimodal dataset distillation methods are constrained by the Matching Training Trajectories algorithm, which significantly increases the computing resource requirement, and takes days to process the distillation. In this work, we introduce EDGE, a generative distillation method for efficient multimodal dataset distillation. Specifically, we identify two key challenges of distilling multimodal datasets with generative models: 1) The lack of correlation between generated images and captions.2) The lack of diversity among generated samples.To address the aforementioned issues, we propose a novel generative model training workflow with a bi-directional contrastive loss and a diversity loss. Furthermore, we propose a caption synthesis strategy to further improve text-to-image retrieval performance by introducing more text information. Our method is evaluated on Flickr30K, COCO, and CC3M datasets, demonstrating superior performance and efficiency compared to existing approaches. Notably, our method achieves results 18
faster than the state-of-the-art method.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
faster than the state-of-the-art method.
Xue, Jiaqi; Kumar, Mayank; Shang, Yuzhang; Gao, Shangqian; Ning, Rui; Zheng, Mengxin; Jiang, Xiaoqian; Lou, Qian
DictPFL: Efficient and Private Federated Learning on Encrypted Gradients Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Xue2025,
title = {DictPFL: Efficient and Private Federated Learning on Encrypted Gradients},
author = {Jiaqi Xue and Mayank Kumar and Yuzhang Shang and Shangqian Gao and Rui Ning and Mengxin Zheng and Xiaoqian Jiang and Qian Lou},
url = {https://neurips.cc/virtual/2025/poster/119806},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Federated learning (FL) enables institutions to collaboratively train machine learning models by aggregating local gradients without sharing sensitive data. However, sharing gradients still poses privacy risks, e.g., gradient inversion attacks. Homomorphic encryption (HE) can be used in FL to encrypt gradients at the data owner's side, enabling secure aggregation without decryption on the server. Existing HE approaches to FL lie at two extremes. One encrypts every gradient update, providing strong privacy but incurring prohibitive computation and bandwidth costs. The other encrypts only a subset of gradients, reducing overhead yet leaving the remaining plaintext updates vulnerable to privacy attacks. Our proposed DictPFL bridges this gap. It encrypts every gradient that must be transmitted to the server—protecting all shared information—while keeping the rest of the (unencrypted) gradients on the client, where they never leave the device. By safeguarding every transmitted update, DictPFL achieves the same privacy guarantees as fully encrypted FL, but its selective-encryption strategy slashes computational and communication overhead. DictPFL comprises two modules: Decompose-for-Partial-Encrypt (DePE) and Prune-for-Minimum-Encrypt (PrME). In DePE, we decompose model weights to be trained into a dictionary and a lookup table. Only the gradients of the lookup table are encrypted and aggregated securely while the dictionary remains fixed and is not transmitted for aggregation. In PrME, we aim to further minimize the encrypted parameters with an encryption-aware pruning technique that ensures a consistent pruning mask across clients by leveraging the history of global gradients. Experimental results demonstrate that DictPFL significantly reduces communication overhead by 402 to 748 times and speeds training by 28 to 65 times compared to fully encrypted method. It also outperforms the state-of-the-art selectively encrypted gradient by lowering overhead by 51 to 155 times and accelerating training by 4 to 19 times. DictPFL increases training time by even less than a 2
factor compared with its plaintext counterpart without gradients protection, demonstrating—for the first time—that HE–based private federated learning is practical for real-world deployment.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
factor compared with its plaintext counterpart without gradients protection, demonstrating—for the first time—that HE–based private federated learning is practical for real-world deployment.
Adak, Deepan; Rawat, Yogesh; Vyas, Shruti
MolVision: Molecular Property Prediction with Vision Language Models Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Adak2025,
title = {MolVision: Molecular Property Prediction with Vision Language Models},
author = {Deepan Adak and Yogesh Rawat and Shruti Vyas},
url = {https://neurips.cc/virtual/2025/poster/121822
https://chemvision.github.io/chemvision/},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Molecular property prediction is a fundamental task in computational chemistry with critical applications in drug discovery and materials science. While recent works have explored Large Language Models (LLMs) for this task, they primarily rely on textual molecular representations such as SMILES/SELFIES, which can be ambiguous and structurally uninformative. In this work, we introduce MolVision, a novel approach that leverages Vision-Language Models (VLMs) by integrating both molecular structure images and textual descriptions to enhance property prediction. We construct a benchmark spanning nine diverse datasets, covering both classification and regression tasks. Evaluating nine different VLMs in zero-shot, few-shot, and fine-tuned settings, we find that visual information improves prediction performance, particularly when combined with efficient fine-tuning strategies such as LoRA. Our results reveal that while visual information alone is insufficient, multimodal fusion significantly enhances generalization across molecular properties. Adaptation of vision encoder for molecular images in conjunction with LoRA further improves the performance. The code and data is available at : https://chemvision.github.io/chemvision/.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Chen, Haodong; Huang, Haojian; Chen, Qifeng; Yang, Harry; Lim, Ser Nam
Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Chen2025e,
title = {Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation},
author = {Haodong Chen and Haojian Huang and Qifeng Chen and Harry Yang and Ser Nam Lim},
url = {https://neurips.cc/virtual/2025/poster/115193},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Recent advancements in video generation have enabled the creation of high-quality, visually compelling videos. However, generating videos that adhere to the laws of physics remains a critical challenge for applications requiring realism and accuracy. In this work, we propose PhysHPO, a novel framework for Hierarchical Cross-Modal Direct Preference Optimization, to tackle this challenge by enabling fine-grained preference alignment for physically plausible video generation. PhysHPO optimizes video alignment across four hierarchical granularities: a) Instance Level, aligning the overall video content with the input prompt; b) State Level, ensuring temporal consistency using boundary frames as anchors; c) Motion Level, modeling motion trajectories for realistic dynamics; and d) Semantic Level, maintaining logical consistency between narrative and visuals. Recognizing that real-world videos are the best reflections of physical phenomena, we further introduce an automated data selection pipeline to efficiently identify and utilize "good data" from existing large-scale text-video datasets, thereby eliminating the need for costly and time-intensive dataset construction. Extensive experiments on both physics-focused and general capability benchmarks demonstrate that PhysHPO significantly improves physical plausibility and overall video generation quality of advanced models. To the best of our knowledge, this is the first work to explore fine-grained preference alignment and data selection for video generation, paving the way for more realistic and human-preferred video generation paradigms.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Huang, Jiani; Keoliya, Mayank; Kuo, Matthew; Velingker, Neelay; Sethi, Amish; Jung, JungHo; Li, Ziyang; Lim, Ser Nam; Naik, Mayur
ESCA: Contextualizing Embodied Agents via Scene-Graph Generation Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Huang2025b,
title = {ESCA: Contextualizing Embodied Agents via Scene-Graph Generation},
author = {Jiani Huang and Mayank Keoliya and Matthew Kuo and Neelay Velingker and Amish Sethi and JungHo Jung and Ziyang Li and Ser Nam Lim and Mayur Naik},
url = {https://neurips.cc/virtual/2025/poster/117064},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, current training pipelines primarily rely on high-level vision-sound-text pairs and lack fine-grained, structured alignment between pixel-level visual content and textual semantics. To overcome this challenge, we propose ESCA, a new framework for contextualizing embodied agents through structured spatial-temporal understanding. At its core is SGClip, a novel CLIP-based, open-domain, and promptable model for generating scene graphs. SGClip is trained on 87K+ open-domain videos via a neurosymbolic learning pipeline, which harnesses model-driven self-supervision from video-caption pairs and structured reasoning, thereby eliminating the need for human-labeled scene graph annotations. We demonstrate that SGClip supports both prompt-based inference and task-specific fine-tuning, excelling in scene graph generation and action localization benchmarks. ESCA with SGClip consistently improves both open-source and commercial MLLMs, achieving state-of-the-art performance across two embodied environments. Notably, it significantly reduces agent perception errors and enables open-source models to surpass proprietary baselines.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Shu, Yan; Lin, Hangui; Liu, Yexin; Zhang, Yan; Zeng, Gangyan; Li, Yan; Zhou, Yu; Lim, Ser Nam; Yang, Harry; Sebe, Nicu
When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Shu2025,
title = {When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding},
author = {Yan Shu and Hangui Lin and Yexin Liu and Yan Zhang and Gangyan Zeng and Yan Li and Yu Zhou and Ser Nam Lim and Harry Yang and Nicu Sebe},
url = {https://neurips.cc/virtual/2025/poster/119366},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination.In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of over 1,730 samples spanning both semantic and non-semantic cases, with manually curated question–answer pairs designed to probe model hallucinations.Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Ghosal, Soumya Suvra; Chakraborty, Souradip; Reddy, Avinash; Lu, Yifu; Wang, Mengdi; Manocha, Dinesh; Huang, Furong; Ghavamzadeh, Mohammad; Bedi, Amrit Singh
Does Thinking More Always Help? Understanding Test-Time Scaling in Reasoning Models Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Ghosal2025b,
title = {Does Thinking More Always Help? Understanding Test-Time Scaling in Reasoning Models},
author = {Soumya Suvra Ghosal and Souradip Chakraborty and Avinash Reddy and Yifu Lu and Mengdi Wang and Dinesh Manocha and Furong Huang and Mohammad Ghavamzadeh and Amrit Singh Bedi},
url = {https://neurips.cc/virtual/2025/poster/115605},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek R1) have led to a popular belief that extending thinking traces using prompts like “Wait” or “Let me rethink” can improve performance. This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which reveals a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to "overthinking". To understand this non-monotonic trend, we consider a simple probabilistic model, which reveals that additional thinking increases output variance—creating an illusion of improved reasoning while ultimately undermining precision. Thus, observed gains from "more thinking" are not true indicators of improved reasoning, but artifacts stemming from the connection between model uncertainty and evaluation metric. This suggests that test-time scaling through extended thinking is not an effective way to utilize the inference thinking budget. Recognizing these limitations, we introduce an alternative test-time scaling approach, parallel thinking, inspired by Best-of-N sampling. Our method generates multiple independent reasoning paths within the same inference budget and selects the most consistent response via majority vote, achieving up to 20% higher accuracy compared to extended thinking. This provides a simple yet effective mechanism for test-time scaling of reasoning models.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Barakat, Anas; Chakraborty, Souradip; Yu, Peihong; Tokekar, Pratap; Bedi, Amrit Singh
On the Global Optimality of Policy Gradient Methods in General Utility Reinforcement Learning Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Barakat2025,
title = {On the Global Optimality of Policy Gradient Methods in General Utility Reinforcement Learning},
author = {Anas Barakat and Souradip Chakraborty and Peihong Yu and Pratap Tokekar and Amrit Singh Bedi},
url = {https://neurips.cc/virtual/2025/poster/117237},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Reinforcement learning with general utilities (RLGU) offers a unifying framework to capture several problems beyond standard expected returns, including imitation learning, pure exploration, and safe RL. Despite recent fundamental advances in the theoretical analysis of policy gradient (PG) for standard RL and recent efforts in RLGU, the understanding of PG methods and their scope of application in RLGU still remain limited. In this work, we establish global optimality guarantees of PG methods for RLGU in which the objective is a general concave utility function of the state-action occupancy measure. In the tabular setting, we provide global optimality results using a new proof technique building on recent theoretical developments on the convergence of PG methods for standard RL using gradient domination. Our proof technique opens avenues for analyzing policy parameterizations beyond the direct policy parameterization for RLGU. In addition, we provide global optimality results for large state action space settings beyond prior work which has mostly focused on the tabular setting. In this large scale setting, we adapt PG methods by approximating occupancy measures within a function approximation class using maximum likelihood estimation. Our sample complexity only scales with the dimension of our function approximation class rather than the size of the state action space.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Rai, Daking; Miller, Samuel; Moran, Kevin; Yao, Ziyu
Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones Conference
The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025.
@conference{Rai2025,
title = {Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones},
author = {Daking Rai and Samuel Miller and Kevin Moran and Ziyu Yao},
url = {https://neurips.cc/virtual/2025/poster/120187},
year = {2025},
date = {2025-11-30},
publisher = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
abstract = {Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses. In this study, we investigate the underlying mechanisms behind the persistence of these errors across LMs of varying sizes (124M–7B) to both understand and mitigate the errors. Our study reveals that LMs rely on a number of components (attention heads and FF neurons) that independently make their own predictions. While some components reliably promote correct answers across a generalized range of inputs (i.e., implementing "sound mechanisms''), others are less reliable and introduce noise by promoting incorrect tokens (i.e., implementing "faulty mechanisms''). Errors occur when the faulty mechanisms overshadow the sound ones and dominantly affect the predictions. Motivated by this insight, we introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance. RASteer substantially improves performance on balanced parentheses tasks, boosting accuracy of some models from 0% to around 100% without impairing the models' general coding ability. We further demonstrate its broader applicability in arithmetic reasoning tasks, achieving performance gains of up to around 20%.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}