MAIL Research aims to develop Machine Learning Algorithms that Make Sense in constrained and large-scale, societal applications in Advertising, Healthcare, Sustainability (Remote Sensing, Computing, Agricultural)…
[more about our research] [SAIL Research ] [photos ] [open office hour ]
VinUni-Illinois Smart Health Center (VISHC) – VISHC is open to collaborate with all researchers and industry practioners inside/oustide of Vietnam to solve healthcare related challenges with translational and innovative research, together. Please reach out via [email] for collaboration. [more about VISHC].
Center for Environmental Intelligence (CEI) – MAIL-Research is a member of CEI. Please reach out via [email] for collaboration on environmental monitoring. [more about CEI].
Selected Press Coverage: khoahocphattrien , Thanh Nien, VnExpress, BaoDauTu, DanTri, Vietnam.vn, Vietnam.vn, Yahoo Finance, Benzinga, Macau Business, Taiwan News, TNGlobal, VinGroup…
Khoa D Doan
Assistant Professor -- College of Engineering & Computer Science, VinUniversity
Associate Director -- VinUni-Illinois Smart Health Center
-
Immediate Positions
: Postdoc (1-2),
working on responsible (safe/secure) ML and mental-health NLP (co-advised with either
(a) Nitesh Chawla, or (b)
Heng Ji and
Dilek Hakkani-Tür,
and VinUni Faculty), please also fill in the form here.
Deadline: Until Filled!
- If you're interested in joining my team, please also fill in the form here.
- Other collaboration? Please reach out via email.
news
| [03/2026] |
|
|---|---|
| [03/2026] |
|
| [02/2026] | Accepted papers - [CVPR’26] on federarated domain generalization and [CVPR’26-Findings] personalized FL with hypernetworks. |
| [01/2026] | Accepted papers - [TMLR’26] (Featured) on a new continual learning theme, [EACL’26] on LLM watermarking (w. minimal text equality degradation), and [ICLR’26] on 2D-3D fusion for predicting molecular properties. |
| [11/2025] | Will serve as Invited Area Chair for ICML 2026. |
| [11/2025] | Gave a talk at National Taiwan University on novel problem solving and LLMs. |
| [11/2025] | Accepted paper - [AAAI’26-a] on clean backdoor attacks with data distillation. |
| [09/2025] | Accepted papers - [NeurIPS’25-a] on fixing DPO/IPO’s overfitting, [NeurIPS’25-b] on parameter attributions in and inference control fo diffusion models, [NeurIPS’25-c] efficient inference with token merging in 3D Point Cloud, and [NeurIPS-W’25-d] on clean backdoor attacks with data distillation. |
| [08/2025] | Will serve as Invited Area Chair for AISTATS and ICLR 2026. |
| [07/2025] |
|
| [07/2025] |
|
| [06/2025] | Accepted papers - [ICML’25-a] (Oral) on evaluating novel equation discovery of LLM-based symbolic regression methods and [ICML-W’25-b] on fragment-aware, structure-guided graph transformer. |
| [04/2025] | Gave a talk at ICLR’25 ML for Science on Low-resource Machine Learning and Opportunities in LMICs. |
| [04/2025] | Attend Global AI Summit on Africa and Grand Challenges AI Community Convening in Kigali, Rwanda. |
| [03/2025] |
|
selected publications [full list]
-
TMLR FEATURED Retrospective Feature Estimation for Continual LearningTransactions on Machine Learning Research 2026
The intrinsic capability to continuously learn a changing data stream is a desideratum of deep neural networks (DNNs). However, current DNNs suffer from catastrophic forgetting, which interferes with remembering past knowledge. To mitigate this issue, existing Continual Learning (CL) approaches often retain exemplars for replay, regularize learning, or allocate dedicated capacity for new tasks. This paper investigates an unexplored direction for CL called Retrospective Feature Estimation (RFE). RFE learns to reverse feature changes by aligning the features from the current trained DNN backward to the feature space of the old task, where performing predictions is easier. This retrospective process utilizes a chain of small feature mapping networks called retrospector modules. Empirical experiments on several CL benchmarks, including CIFAR10, CIFAR100, and Tiny ImageNet, demonstrate the effectiveness and potential of this novel CL direction compared to existing representative CL methods, motivating further research into retrospective mechanisms as a principled alternative for mitigating catastrophic forgetting in CL. Code is available at: https://github.com/mail-research/retrospective-feature-estimation
@article{nguyen2026irl, title = {Retrospective Feature Estimation for Continual Learning}, author = {Nguyen, Nghia D and Nguyen, Hieu T and Li, Ang and Pham, Hoang V and Nguyen, Viet Anh and Doan, Khoa D}, year = {2026}, bibtex_show = {true}, abbr = {TMLR}, journal = {Transactions on Machine Learning Research}, pdf = {https://openreview.net/pdf?id=9NnhVME4Q6}, code = {https://github.com/mail-research/retrospective-feature-estimation}, selected = {true}, note = {FEATURED}, submissions = {ICML'24 -- NeurIPS'24 -- ICLR'25 -- ICCV'25 -- WACV'26 -- TMLR'26} }ICML'24 -- NeurIPS'24 -- ICLR'25 -- ICCV'25 -- WACV'26 -- TMLR'26
-
EACL SpARK: An Embarrassingly Simple Sparse Watermarking in LLMs with Enhanced Text QualityIn Findings of the European Chapter of the Association for Computational Linguistics 2026
With the widespread adoption of Large Language Models (LLMs), concerns about potential misuse have emerged. To this end, watermarking has been adapted to LLM, enabling a simple and effective way to detect and monitor generated text. However, while the existing methods can differentiate between watermarked and unwatermarked text with high accuracy, they often face a trade-off between the quality of the generated text and the effectiveness of the watermarking process. In this work, we present a novel type of LLM watermark, Sparse WatermARK (or SpARK), which aims to mitigate this trade-off by applying watermarks to a small subset of generated tokens distributed across the text. To demonstrate this type of watermark, we introduce two novel variants, SpARK-P and SpARK-R, which achieve sparsity by anchoring watermarked tokens to words that have specific Part-of-Speech (POS) tags and specific hash values w.r.t a pseudorandom hash function, respectively. Our experimental results demonstrate that the proposed watermarking schemes, albeit embarrassingly simple, are incredibly effective, achieving high detectability while generating text that outperforms previous LLM watermarking methods in quality across various tasks. SpARK further advances the watermarking capability for LLMs while maintaining their generated text quality.
@inproceedings{cao2026sparsellmwatermark, title = {SpARK: An Embarrassingly Simple Sparse Watermarking in LLMs with Enhanced Text Quality}, author = {Hoang, Cao-Duy and Le, Hung T. Q. and Chu, Rui and Li, Ping and Zhao, Weijie and Lao, Yingjie and Doan, Khoa D}, year = {2026}, bibtex_show = {true}, abbr = {EACL}, booktitle = {Findings of the European Chapter of the Association for Computational Linguistics}, code = {https://github.com/mail-research/sparse-llm-watermarking}, selected = {true}, submissions = {NeurIPS'24 -- ICLR'25 -- COLM'25 -- ARR'05-25 -- ARR'10-25 (EACL)} }NeurIPS'24 -- ICLR'25 -- COLM'25 -- ARR'05-25 -- ARR'10-25 (EACL)
-
NeurIPS How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?Tuan A Tran, Duy MH Nguyen, Chau H Tran, and othersIn Advances in Neural Information Processing Systems 2025
Recent advances in 3D point cloud transformers have led to state-of-the-art results in tasks such as semantic segmentation and reconstruction. However, these models typically rely on dense token representations, incurring high computational and memory costs during training and inference. In this work, we present an efficient token merging strategy that drastically reduces the token count by up to 90–95\% while preserving competitive performance. Our approach estimates token importance by leveraging spatial structures within the 3D point cloud, enabling aggressive token reduction with minimal degradation in accuracy. This finding challenges the prevailing assumption that more tokens inherently yield better performance and highlights that many current models are over-tokenized and under-optimized for scalability. We validate our method across multiple 3D vision tasks and show consistent improvements in computational efficiency. Our ongoing work will release code and detailed benchmarks to support reproducibility and further system-level exploration of efficient foundation models for 3D data.
@inproceedings{nguyen2025merging3d, title = {How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?}, author = {Tran, Tuan A and Nguyen, Duy MH and Tran, Chau H and others}, booktitle = {Advances in Neural Information Processing Systems}, year = {2025}, bibtex_show = {true}, abbr = {NeurIPS}, selected = {true}, submissions = {NeurIPS'25} }NeurIPS'25
-
NeurIPS Mitigating Reward Over-optimization in Direct Alignment Algorithms with Adaptive Importance SamplingPhuc M Nguyen, Ngoc-Hieu Nguyen, Binh T Nguyen, and Khoa D DoanIn Advances in Neural Information Processing Systems 2025
Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO) have emerged as alternatives to the standard Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human values. However, these methods are more susceptible to over-optimization, in which the model drifts away from the reference policy, leading to degraded performance as training progresses. This paper proposes a novel importance-sampling approach to mitigate the over-optimization problem of offline DAAs. This approach, called (IS-DAAs), multiplies the DAA objective with an importance ratio that accounts for the reference policy distribution. IS-DAAs additionally avoid the high variance issue associated with importance sampling by clipping the importance ratio to a maximum value. Our extensive experiments demonstrate that IS-DAAs can effectively mitigate over-optimization, especially under low regularization strength, and achieve better performance than other methods designed to address this problem.
@inproceedings{nguyen2025ais, title = {Mitigating Reward Over-optimization in Direct Alignment Algorithms with Adaptive Importance Sampling}, author = {Nguyen, Phuc M and Nguyen, Ngoc-Hieu and Nguyen, Binh T and Doan, Khoa D}, booktitle = {Advances in Neural Information Processing Systems}, year = {2025}, bibtex_show = {true}, abbr = {NeurIPS}, pdf = {https://arxiv.org/abs/2506.08681}, selected = {true}, code = {https://github.com/mail-research/AIS-Sampling4DAAs}, submissions = {ICLR'25 -- COLM'25 (withdraw-missing title) -- NeurIPS'25} }ICLR'25 -- COLM'25 (withdraw-missing title) -- NeurIPS'25
-
NeurIPS Unveiling Concept Attribution in Diffusion ModelsQuang H Nguyen, Phan Hoang, and Khoa D DoanIn Advances in Neural Information Processing Systems 2025
Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains black-box; little do we know about the role of its components in exhibiting a concept such as objects or styles. Recent works employ causal tracing to localize layers storing knowledge in generative models without showing how those layers contribute to the target concept. In this work, we approach the model interpretability problem from a more general perspective and pose a question: \textit{``How do model components work jointly to demonstrate knowledge?''}. We adapt component attribution to decompose diffusion models, unveiling how a component contributes to a concept. Our framework allows effective model editing, in particular, we can erase a concept from diffusion models by removing positive components while remaining knowledge of other concepts. Surprisingly, we also show there exist components that contribute negatively to a concept, which has not been discovered in the knowledge localization approach. Experimental results confirm the role of positive and negative components pinpointed by our framework, depicting a complete view of interpreting generative models. Our code is available at https://github.com/mail-research/CAD-attribution4diffusion
@inproceedings{nguyen2025cad, title = {Unveiling Concept Attribution in Diffusion Models}, author = {Nguyen, Quang H and Hoang, Phan and Doan, Khoa D}, booktitle = {Advances in Neural Information Processing Systems}, year = {2025}, bibtex_show = {true}, abbr = {NeurIPS}, pdf = {https://arxiv.org/abs/2412.02542}, selected = {true}, code = {https://github.com/mail-research/CAD-attribution4diffusion}, submissions = {ICLR'25 -- CVPR'25 -- ICCV'25 -- NeurIPS'25} }ICLR'25 -- CVPR'25 -- ICCV'25 -- NeurIPS'25
-
ICML ORAL LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language ModelsParshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, and Chandan K ReddyIn International Conference on Machine Learning 2025
Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.
@inproceedings{shojaee2025llmsrbench, title = {LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models}, author = {Shojaee, Parshin and Nguyen, Ngoc-Hieu and Meidani, Kazem and Farimani, Amir Barati and Doan, Khoa D and Reddy, Chandan K}, booktitle = {International Conference on Machine Learning}, year = {2025}, note = {ORAL}, bibtex_show = {true}, abbr = {ICML}, pdf = {https://arxiv.org/abs/2504.10415}, selected = {true}, code = {https://github.com/deep-symbolic-mathematics/llm-srbench}, data = {https://huggingface.co/datasets/nnheui/llm-srbench}, submissions = {ICML'25} }ICML'25
-
ICLR Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor AttacksQuang H Nguyen, Ngoc-Hieu Nguyen, The-Anh Ta, Thanh Nguyen-Tang, Kok-Seng Wong, Hoang Thanh-Tung, and Khoa D DoanIn The Twelfth International Conference on Learning Representations 2025
Deep neural networks are vulnerable to backdoor attacks, a type of adversarial attack that poisons the training data to manipulate the behavior of models trained on such data. Clean-label attacks are a more stealthy form of backdoor attacks that can perform the attack without changing the labels of poisoned data. Early works on clean-label attacks added triggers to a random subset of the training set, ignoring the fact that samples contribute unequally to the attack's success. This results in high poisoning rates and low attack success rates. To alleviate the problem, several supervised learning-based sample selection strategies have been proposed. However, these methods assume access to the entire labeled training set and require training, which is expensive and may not always be practical. This work studies a new and more practical (but also more challenging) threat model where the attacker only provides data for the target class (e.g., in face recognition systems) and has no knowledge of the victim model or any other classes in the training set. We study different strategies for selectively poisoning a small set of training samples in the target class to boost the attack success rate in this setting. Our threat model poses a serious threat in training machine learning models with third-party datasets, since the attack can be performed effectively with limited information. Experiments on benchmark datasets illustrate the effectiveness of our strategies in improving clean-label backdoor attacks.
@inproceedings{nguyen2024wicked, title = {Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks}, author = {Nguyen, Quang H and Nguyen, Ngoc-Hieu and Ta, The-Anh and Nguyen-Tang, Thanh and Wong, Kok-Seng and Thanh-Tung, Hoang and Doan, Khoa D}, booktitle = {The Twelfth International Conference on Learning Representations}, year = {2025}, bibtex_show = {true}, abbr = {ICLR}, pdf = {https://openreview.net/forum?id=1Z3C49JQVf}, selected = {true}, code = {https://github.com/mail-research/wicked-oddities-backdoor}, submissions = {NeurIPS-W'23 -- ICML'24 -- NeurIPS'24 -- ICLR'25} }NeurIPS-W'23 -- ICML'24 -- NeurIPS'24 -- ICLR'25
-
ECCV ORAL Flatness-aware Sequential Learning Generates Resilient BackdoorsHoang V Pham, The-Anh Ta, Anh Tran, and Khoa D DoanIn European Conference on Computer Vision 2024
Recently, backdoor attacks have become an emerging threat to the security of machine learning models. From the adversary's perspective, the implanted backdoors should be resistant to defensive algorithms, but some recently proposed fine-tuning defenses can remove these backdoors with notable efficacy. This is mainly due to the catastrophic forgetting (CF) property of deep neural networks. This paper counters CF of backdoors by leveraging continual learning (CL) techniques. We begin by investigating the connectivity between a backdoored and fine-tuned model in the loss landscape. Our analysis confirms that fine-tuning defenses, especially the more advanced ones, can easily push a poisoned model out of the backdoor regions, making it forget all about the backdoors. Based on this finding, we re-formulate backdoor training through the lens of CL and propose a novel framework, named Sequential Backdoor Learning (SBL), that can generate resilient backdoors. This framework separates the backdoor poisoning process into two tasks: the first task learns a backdoored model, while the second task, based on the CL principles, moves it to a backdoored region resistant to fine-tuning. We additionally propose to seek flatter backdoor regions via a sharpness-aware minimizer in the framework, further strengthening the durability of the implanted backdoor. Finally, we demonstrate the effectiveness of our method through extensive empirical experiments on several benchmark datasets in the backdoor domain. The source code is available at https://github.com/mail-research/SBL-resilient-backdoors
@inproceedings{pham2024SBL, title = {Flatness-aware Sequential Learning Generates Resilient Backdoors}, author = {Pham, Hoang V and Ta, The-Anh and Tran, Anh and Doan, Khoa D}, booktitle = {European Conference on Computer Vision}, year = {2024}, note = {ORAL}, code = {https://github.com/mail-research/SBL-resilient-backdoors}, bibtex_show = {true}, abbr = {ECCV}, pdf = {https://www.arxiv.org/abs/2407.14738}, slides = {https://docs.google.com/presentation/d/1YEyQDSBardXdHCv-qBmxZnBwnXMAtRKZfWsawknowFs/edit?usp=sharing}, submissions = {CVPR'24 -- ECCV'24}, selected = {true} }CVPR'24 -- ECCV'24
-
ECCV Data Poisoning Quantization Backdoor AttackTran Huynh, Anh Tran, Khoa D Doan, and Tung PhamIn European Conference on Computer Vision 2024
Deep learning (DL) models are often large and require a lot of computing power. Hence, model quantization is frequently used to reduce their size and complexity, making them more suitable for deployment on edge devices or achieving real-time performance. It has been previously shown that standard quantization frameworks can be exploited to activate the backdoor in a DL model. This means that an attacker could create a hijacked model that appears normal and free from backdoors (even when examined by state-of-the-art defenses), but when it is quantized, the backdoor is activated, and the attacker can control the model’s output. Existing backdoor attack methods on quantization models require full access to the victim model, which might not hold in practice. In this work, we focus on designing a novel quantization backdoor based on data poisoning, which requires zero knowledge of the target model. The key component is a trigger pattern generator, which is trained together with a surrogate model in an alternating manner. The attack’s effectiveness is tested on multiple benchmark datasets, including CIFAR10, CelebA, and ImageNet10, as well as state-of-the-art backdoor defenses.
@inproceedings{huynh2024quantizedbd, title = {Data Poisoning Quantization Backdoor Attack}, author = {Huynh, Tran and Tran, Anh and Doan, Khoa D and Pham, Tung}, booktitle = {European Conference on Computer Vision}, year = {2024}, bibtex_show = {true}, abbr = {ECCV}, pdf = {https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/11142.pdf}, submissions = {CVPR'24 -- ECCV'24}, selected = {true} }CVPR'24 -- ECCV'24
-
PREPRINT MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs2024
The rapid progress in machine learning (ML) has brought forth many large language models (LLMs) that excel in various tasks and areas. These LLMs come with different abilities and costs in terms of computation or pricing. Since the demand for each query can vary, e.g., because of the queried domain or its complexity, defaulting to one LLM in an application is not usually the best choice, whether it is the biggest, priciest, or even the one with the best average test performance. Consequently, picking the right LLM that is both accurate and cost-effective for an application remains a challenge. In this paper, we introduce MetaLLM, a framework that dynamically and intelligently routes each query to the optimal LLM (among several available LLMs) for classification tasks, achieving significantly improved accuracy and cost-effectiveness. By framing the selection problem as a multi-armed bandit, MetaLLM balances prediction accuracy and cost efficiency under uncertainty. Our experiments, conducted on popular LLM platforms such as OpenAI's GPT models, Amazon's Titan, Anthropic's Claude, and Meta's LLaMa, showcase MetaLLM's efficacy in real-world scenarios, laying the groundwork for future extensions beyond classification tasks.
@article{nguyen2024metallm, title = {MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs}, author = {Nguyen, Quang H and Hoang, Cao-Duy and Decugis, Juliette and Manchanda, Saurav and Chawla, Nitesh V and Doan, Khoa D}, year = {2024}, bibtex_show = {true}, abbr = {PREPRINT}, pdf = {https://arxiv.org/abs/2407.10834}, code = {https://github.com/mail-research/MetaLLM-wrapper/}, selected = {true} } -
UAI Cold-start Recommendation by Personalized Embedding Region ElicitationIn The Conference on Uncertainty in Artificial Intelligence 2023
Rating elicitation is a success element for recommender systems to perform well at cold-starting, in which the systems need to recommend items to a newly arrived user with no prior knowledge about the user's preference. Existing elicitation methods employ a fixed set of items to learn the user's preference and then infer the users' preferences on the remaining items. Using a fixed seed set can limit the performance of the recommendation system since the seed set is unlikely optimal for all new users with potentially diverse preferences. This paper addresses this challenge using a 2-phase, personalized elicitation scheme. First, the elicitation scheme asks users to rate a small set of popular items in a ``burn-in'' phase. Second, it sequentially asks the user to rate adaptive items to refine the preference and the user's representation. Throughout the process, the system represents the user's embedding value not by a point estimate but by a region estimate. The value of information obtained by asking the user's rating on an item is quantified by the distance from the region center embedding space that contains with high confidence the true embedding value of the user. Finally, the recommendations are successively generated by considering the preference region of the user. We show that each subproblem in the elicitation scheme can be efficiently implemented. Further, we empirically demonstrate the effectiveness of the proposed method against existing rating-elicitation methods on several prominent datasets.
@inproceedings{nguyen2024cold, title = {Cold-start Recommendation by Personalized Embedding Region Elicitation}, author = {Nguyen, Hieu Trung and Nguyen, Duy and Doan, Khoa D and Nguyen, Viet Anh}, booktitle = {The Conference on Uncertainty in Artificial Intelligence}, year = {2023}, bibtex_show = {true}, abbr = {UAI}, pdf = {https://arxiv.org/pdf/2406.00973}, submissions = {RecSys'23 -- CIKM'23 -- AAAI'24 -- UAI'24}, selected = {true} }RecSys'23 -- CIKM'23 -- AAAI'24 -- UAI'24
-
PREPRINT Synthesizing Physical Backdoor Datasets: An Automated Framework Leveraging Deep Generative ModelsSze Jue Yang, Chinh D La, Quang H Nguyen, Eugene Bagdasaryan, Kok-Seng Wong, Anh T Tran, Chee Seng Chan, and Khoa D Doan2024
Backdoor attacks, representing an emerging threat to the integrity of deep neural networks, have garnered significant attention due to their ability to compromise deep learning systems clandestinely. While numerous backdoor attacks occur within the digital realm, their practical implementation in real-world prediction systems remains limited and vulnerable to disturbances in the physical world. Consequently, this limitation has given rise to the development of physical backdoor attacks, where trigger objects manifest as physical entities within the real world. However, creating the requisite dataset to train or evaluate a physical backdoor model is a daunting task, limiting the backdoor researchers and practitioners from studying such physical attack scenarios. This paper unleashes a recipe that empowers backdoor researchers to effortlessly create a malicious, physical backdoor dataset based on advances in generative modeling. Particularly, this recipe involves 3 automatic modules: suggesting the suitable physical triggers, generating the poisoned candidate samples (either by synthesizing new samples or editing existing clean samples), and finally refining for the most plausible ones. As such, it effectively mitigates the perceived complexity associated with creating a physical backdoor dataset, transforming it from a daunting task into an attainable objective. Extensive experiment results show that datasets created by our “recipe” enable adversaries to achieve an impressive attack success rate on real physical world data and exhibit similar properties compared to previous physical backdoor attack studies. This paper offers researchers a valuable toolkit for studies of physical backdoors, all within the confines of
@article{yang2023synthesizing, title = {Synthesizing Physical Backdoor Datasets: An Automated Framework Leveraging Deep Generative Models}, author = {Yang, Sze Jue and La, Chinh D and Nguyen, Quang H and Bagdasaryan, Eugene and Wong, Kok-Seng and Tran, Anh T and Chan, Chee Seng and Doan, Khoa D}, year = {2024}, bibtex_show = {true}, selected = {true}, abbr = {PREPRINT}, code = {https://github.com/mail-research/synthetic-physical-backdoor-datasets}, pdf = {https://arxiv.org/abs/2312.03419} } -
ACL-Findings Fooling the Textual Fooler via Randomizing Latent RepresentationsIn Findings of the Association for Computational Linguistics 2024
Despite outstanding performance in a variety of NLP tasks, recent studies have revealed that NLP models are vulnerable to adversarial attacks that slightly perturb the input to cause the models to misbehave. Among these attacks, adversarial word-level perturbations are well-studied and effective attack strategies. Since these attacks work in black-box settings, they do not require access to the model architecture or model parameters and thus can be detrimental to existing NLP applications. To perform an attack, the adversary queries the victim model many times to determine the most important words in an input text and to replace these words with their corresponding synonyms. In this work, we propose a lightweight and attack-agnostic defense whose main goal is to perplex the process of generating an adversarial example in these query-based black-box attacks; that is to fool the textual fooler. This defense, named AdvFooler, works by randomizing the latent representation of the input at inference time. Different from existing defenses, AdvFooler does not necessitate additional computational overhead during training nor relies on assumptions about the potential adversarial perturbation set while having a negligible impact on the model's accuracy. Our theoretical and empirical analyses highlight the significance of robustness resulting from confusing the adversary via randomizing the latent space, as well as the impact of randomization on clean accuracy. Finally, we empirically demonstrate near state-of-the-art robustness of AdvFooler against representative adversarial word-level attacks on two benchmark datasets.
@inproceedings{hoang2023advfooler, title = {Fooling the Textual Fooler via Randomizing Latent Representations}, author = {Hoang, Cao-Duy and Nguyen, Quang H and Manchanda, Saurav and Peng, Minlong and Wong, Kok-Seng and Doan, Khoa D}, year = {2024}, abbr = {ACL-Findings}, bibtex_show = {true}, booktitle = {Findings of the Association for Computational Linguistics}, pdf = {https://arxiv.org/abs/2310.01452}, code = {https://github.com/mail-research/AdvFooler-text-defender}, submissions = {EMNLP'24 -- ICLR'24 -- ACL'24}, selected = {true} }EMNLP'24 -- ICLR'24 -- ACL'24
-
ICLR Understanding the Robustness of Randomized Feature Defense Against Query-Based Adversarial AttacksIn The Twelfth International Conference on Learning Representations 2024
Recent works have shown that deep neural networks are vulnerable to adversarial examples that find samples close to the original image but can make the model misclassify. Even with access only to the model's output, an attacker can employ black-box attacks to generate such adversarial examples. In this work, we propose a simple and lightweight defense against black-box attacks by adding random noise to hidden features at intermediate layers of the model at inference time. Our theoretical analysis confirms that this method effectively enhances the model's resilience against both score-based and decision-based black-box attacks. Importantly, our defense does not necessitate adversarial training and has minimal impact on accuracy, rendering it applicable to any pre-trained model. Our analysis also reveals the significance of selectively adding noise to different parts of the model based on the gradient of the adversarial objective function, which can be varied during the attack. We demonstrate the robustness of our defense against multiple black-box attacks through extensive empirical experiments involving diverse models with various architectures.
@inproceedings{nguyen2023randomized, title = {Understanding the Robustness of Randomized Feature Defense Against Query-Based Adversarial Attacks}, author = {Nguyen, Quang H and Lao, Yingjie and Pham, Tung and Wong, Kok-Seng and Doan, Khoa D}, booktitle = {The Twelfth International Conference on Learning Representations}, year = {2024}, bibtex_show = {true}, abbr = {ICLR}, pdf = {https://openreview.net/forum?id=vZ6r9GMT1n}, selected = {true}, code = {https://github.com/mail-research/randomized_defenses}, submissions = {NeurIPS'23 -- ICLR'24} }NeurIPS'23 -- ICLR'24
-
NeurIPS Iba: Towards irreversible backdoor attacks in federated learningAdvances in Neural Information Processing Systems 2024
Federated learning (FL) is a distributed learning approach that enables machine learning models to be trained on decentralized data without compromising end devices' personal, potentially sensitive data. However, the distributed nature and uninvestigated data intuitively introduce new security vulnerabilities, including backdoor attacks. In this scenario, an adversary implants backdoor functionality into the global model during training, which can be activated to cause the desired misbehaviors for any input with a specific adversarial pattern. Despite having remarkable success in triggering and distorting model behavior, prior backdoor attacks in FL often hold impractical assumptions, limited imperceptibility, and durability. Specifically, the adversary needs to control a sufficiently large fraction of clients or know the data distribution of other honest clients. In many cases, the trigger inserted is often visually apparent, and the backdoor effect is quickly diluted if the adversary is removed from the training process. To address these limitations, we propose a novel backdoor attack framework in FL, the Irreversible Backdoor Attack (IBA), that jointly learns the optimal and visually stealthy trigger and then gradually implants the backdoor into a global model. This approach allows the adversary to execute a backdoor attack that can evade both human and machine inspections. Additionally, we enhance the efficiency and durability of the proposed attack by selectively poisoning the model's parameters that are least likely updated by the main task's learning process and constraining the poisoned model update to the vicinity of the global model. Finally, we evaluate the proposed attack framework on several benchmark datasets, including MNIST, CIFAR-10, and Tiny ImageNet, and achieved high success rates while simultaneously bypassing existing backdoor defenses and achieving a more durable backdoor effect compared to other backdoor attacks. Overall, IBA offers a more effective, stealthy, and durable approach to backdoor attacks in FL.
@article{nguyen2024iba, title = {Iba: Towards irreversible backdoor attacks in federated learning}, author = {Nguyen, Thuy Dung and Nguyen, Tuan M and Tran, Anh T and Doan, Khoa D and Wong, Kok-Seng}, journal = {Advances in Neural Information Processing Systems}, volume = {36}, year = {2024}, submissions = {NeurIPS'23}, bibtex_show = {true}, abbr = {NeurIPS}, pdf = {https://proceedings.neurips.cc/paper_files/paper/2023/hash/d0c6bc641a56bebee9d985b937307367-Abstract-Conference.html}, code = {https://github.com/sail-research/iba}, selected = {true} }NeurIPS'23
-
EAAI Backdoor attacks and defenses in federated learning: Survey, challenges and future research directionsEngineering Applications of Artificial Intelligence 2024
Federated learning (FL) is an approach within the realm of machine learning (ML) that allows the use of distributed data without compromising personal privacy. In FL, it becomes evident that the training data among participants frequently exhibit heterogeneous distribution characteristics. This inherent heterogeneity poses a substantial challenge for the orchestration server as it strives to assess the reliability of each local model update. Due to this challenge, FL becomes susceptible to various potential risks, with the ominous backdoor attack standing out as one of the most menacing threats. Backdoor attacks involve the insertion of malicious functionality into a targeted model through poisoned updates from malicious clients. These attacks can cause the global model to misbehave on specific inputs while appearing normal in other instances. Although the backdoor attacks received significant attention for their potential impact on practical deep learning applications, their exploration within the realm of FL remains limited. This survey seeks to address this gap by offering an all-encompassing examination of prevailing backdoor attack tactics and defenses in the context of FL. We include an exhaustive analysis of diverse approaches to provide a comprehensive understanding of this intricate landscape. Furthermore, we also discuss the challenges and potential future directions for attacks and defenses in the context of FL.
@article{nguyen2024backdoor, title = {Backdoor attacks and defenses in federated learning: Survey, challenges and future research directions}, author = {Nguyen, Thuy Dung and Nguyen, Tuan M and Le Nguyen, Phi and Pham, Hieu H and Doan, Khoa D and Wong, Kok-Seng}, journal = {Engineering Applications of Artificial Intelligence}, volume = {127}, pages = {107166}, year = {2024}, publisher = {Elsevier}, submissions = {EAAI'24}, bibtex_show = {true}, abbr = {EAAI}, pdf = {https://www.sciencedirect.com/science/article/pii/S0952197623013507}, selected = {true} }EAAI'24
-
SIGIR Asymmetric Hashing for Fast Ranking via Neural Network MeasuresIn 46th International ACM SIGIR Conference on Research and Development in Information Retrieval 2023
Fast item ranking is an important task in recommender systems. In previous works, graph-based Approximate Nearest Neighbor (ANN) approaches have demonstrated good performance on item ranking tasks with generic searching/matching measures (including complex measures such as neural network measures). However, since these ANN approaches must go through the neural measures several times during ranking, the computation is not practical if the neural measure is a large network. On the other hand, fast item ranking using existing hashing-based approaches, such as Locality Sensitive Hashing (LSH), only works with a limited set of measures, such as cosine and Euclidean distance, but not with general search measures such as neural networks. Given an arbitrary searching measure, previous learning-to-hash approaches are also not suitable to solve the fast item ranking problem since they can take a significant amount of time and computation to train the hash functions to approximate the searching measure due to a large number of possible training pairs in this problem. Hashing approaches, however, are attractive because they provide a principal and efficient way to retrieve candidate items. In this paper, we propose a simple and effective learning-to-hash approach for the fast item ranking problem that can be used to efficiently approximate any type of measure, including neural network measures. Specifically, we solve this problem with an asymmetric hashing framework based on discrete inner product fitting. We learn a pair of related hash functions that map heterogeneous objects (e.g., users and items) into a common discrete space where the inner product of their binary codes reveals their true similarity defined via the original searching measure. The fast ranking problem is reduced to an ANN search via this asymmetric hashing scheme. Then, we propose a sampling strategy to efficiently select relevant and contrastive samples to train the hashing model. We empirically validate the proposed method against the existing state-of-the-art fast item ranking methods in several combinations of non-linear searching functions and prominent datasets.
@inproceedings{doan2023flora, bibtex_show = {true}, abbr = {SIGIR}, title = {Asymmetric Hashing for Fast Ranking via Neural Network Measures}, author = {Doan, Khoa D and Tan, Shulong and Zhao, Weijie and Li, Ping}, booktitle = {46th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year = {2023}, submissions = {SIGIR'21 -- RecSys'21 -- WSDM'22 -- WWW'22 -- SIGIR'22 -- VLDB'23 -- CIKM'22 -- WWW'23 -- SIGIR'23}, pdf = {https://dl.acm.org/doi/abs/10.1145/3539618.3591640}, selected = {true} }SIGIR'21 -- RecSys'21 -- WSDM'22 -- WWW'22 -- SIGIR'22 -- VLDB'23 -- CIKM'22 -- WWW'23 -- SIGIR'23
-
AAAI Defending backdoor attacks on vision transformer via patch processingKhoa D Doan, Yingjie Lao, and Ping LiIn AAAI Conference on Artificial Intelligence 2023
Vision Transformers (ViTs) have a radically different architecture with significantly less inductive bias than Convolutional Neural Networks. Along with the improvement in performance, security and robustness of ViTs are also of great importance to study. In contrast to many recent works that exploit the robustness of ViTs against adversarial examples, this paper investigates a representative causative attack, ie, backdoor. We first examine the vulnerability of ViTs against various backdoor attacks and find that ViTs are also quite vulnerable to existing attacks. However, we observe that the clean-data accuracy and backdoor attack success rate of ViTs respond distinctively to patch transformations before the positional encoding. Then, based on this finding, we propose an effective method for ViTs to defend both patch-based and blending-based trigger backdoor attacks via patch processing. The performances are evaluated on several benchmark datasets, including CIFAR10, GTSRB, and TinyImageNet, which show the proposedds defense is very successful in mitigating backdoor attacks for ViTs. To the best of our knowledge, this paper presents the first defensive strategy that utilizes a unique characteristic of ViTs against backdoor attacks.
@inproceedings{doan2023bdvitt, abbr = {AAAI}, title = {Defending backdoor attacks on vision transformer via patch processing}, author = {Doan, Khoa D and Lao, Yingjie and Li, Ping}, booktitle = {AAAI Conference on Artificial Intelligence}, year = {2023}, submissions = {CVPR'22 -- ICCV'22 -- AAAI'23}, bibtex_show = {true}, selected = {true}, teaser = {doan2023bdvits.png}, pdf = {https://ojs.aaai.org/index.php/AAAI/article/view/25125} }CVPR'22 -- ICCV'22 -- AAAI'23
-
NeurIPS Marksman Backdoor: Backdoor Attacks with Arbitrary Target ClassKhoa D Doan, Yingjie Lao, and Ping LiIn Thirty-Sixth Conference on Neural Information Processing Systems 2022
In recent years, machine learning models have been shown to be vulnerable to backdoor attacks. Under such attacks, an adversary embeds a stealthy backdoor into the trained model such that the compromised models will behave normally on clean inputs but will misclassify according to the adversary's control on maliciously constructed input with a trigger. While these existing attacks are very effective, the adversary's capability is limited: given an input, these attacks can only cause the model to misclassify toward a single pre-defined or target class. In contrast, this paper exploits a novel backdoor attack with a much more powerful payload, denoted as Marksman, where the adversary can arbitrarily choose which target class the model will misclassify given any input during inference. To achieve this goal, we propose to represent the trigger function as a class-conditional generative model and to inject the backdoor in a constrained optimization framework, where the trigger function learns to generate an optimal trigger pattern to attack any target class at will while simultaneously embedding this generative backdoor into the trained model. Given the learned trigger-generation function, during inference, the adversary can specify an arbitrary backdoor attack target class, and an appropriate trigger causing the model to classify toward this target class is created accordingly. We show empirically that the proposed framework achieves high attack performance (e.g., 100% attack success rates in several experiments) while preserving the clean-data performance in several benchmark datasets, including MNIST, CIFAR10, GTSRB, and TinyImageNet. The proposed Marksman backdoor attack can also easily bypass existing backdoor defenses that were originally designed against backdoor attacks with a single target class. Our work takes another significant step toward understanding the extensive risks of backdoor attacks in practice.
@inproceedings{doan2022marksman, bibtex_show = {true}, abbr = {NeurIPS}, title = {Marksman Backdoor: Backdoor Attacks with Arbitrary Target Class}, author = {Doan, Khoa D and Lao, Yingjie and Li, Ping}, booktitle = {Thirty-Sixth Conference on Neural Information Processing Systems}, year = {2022}, submissions = {NeurIPS'22}, teaser = {doan2022marksman.png}, pdf = {https://openreview.net/forum?id=i-k6J4VkCDq}, code = {https://github.com/khoadoan106/backdoor_attacks}, slides = {https://nips.cc/media/neurips-2022/Slides/52924.pdf}, selected = {true} }NeurIPS'22
-
CVPR One Loss for Quantization: Deep Hashing with Discrete Wasserstein Distributional MatchingIn Conference on Computer Vision and Pattern Recognition 2022
Image hashing is a principled approximate nearest neighbor approach to find similar items to a query in a large collection of images. Hashing aims to learn a binary-output function that maps an image to a binary vector. For optimal retrieval performance, producing balanced hash codes with low-quantization error to bridge the gap between the learning stage's continuous relaxation and the inference stage's discrete quantization is important. However, in the existing deep supervised hashing methods, coding balance and low-quantization error are difficult to achieve and involve several losses. We argue that this is because the existing quantization approaches in these methods are heuristically constructed and not effective to achieve these objectives. This paper considers an alternative approach to learning the quantization constraints. The task of learning balanced codes with low quantization error is re-formulated as matching the learned distribution of the continuous codes to a pre-defined discrete, uniform distribution. This is equivalent to minimizing the distance between two distributions. We then propose a computationally efficient distributional distance by leveraging the discrete property of the hash functions. This distributional distance is a valid distance and enjoys lower time and sample complexities. The proposed single-loss quantization objective can be integrated into any existing supervised hashing method to improve code balance and quantization error. Experiments confirm that the proposed approach substantially improves the performance of several representative hashing methods.
@inproceedings{doan2022hswd, bibtex_show = {true}, abbr = {CVPR}, title = {One Loss for Quantization: Deep Hashing with Discrete Wasserstein Distributional Matching}, author = {Doan, Khoa D and Yang, Peng and Li, Ping}, booktitle = {Conference on Computer Vision and Pattern Recognition}, year = {2022}, submissions = {CVPR'22}, slides = {doan2022hswd-slides.pdf}, code = {https://github.com/khoadoan106/single_loss_quantization}, pdf = {https://openreview.net/pdf?id=uaqweIZ-9_k}, teaser = {doan2022hswd.png}, selected = {true} }CVPR'22
-
NeurIPS Backdoor Attack with Imperceptible Input and Latent ModificationKhoa D Doan, Yingjie Lao, and Ping LiIn Thirty-Fifth Conference on Neural Information Processing Systems 2021
Recent studies have shown that deep neural networks (DNN) are vulnerable to various adversarial attacks. In particular, an adversary can inject a stealthy backdoor into a model such that the compromised model will behave normally without the presence of the trigger. Techniques for generating backdoor images that are visually imperceptible from clean images have also been developed recently, which further enhance the stealthiness of the backdoor attacks from the input space. Along with the development of attacks, defense against backdoor attacks is also evolving. Many existing countermeasures found that backdoor tends to leave tangible footprints in the latent or feature space, which can be utilized to mitigate backdoor attacks.In this paper, we extend the concept of imperceptible backdoor from the input space to the latent representation, which significantly improves the effectiveness against the existing defense mechanisms, especially those relying on the distinguishability between clean inputs and backdoor inputs in latent space. In the proposed framework, the trigger function will learn to manipulate the input by injecting imperceptible input noise while matching the latent representations of the clean and manipulated inputs via a Wasserstein-based regularization of the corresponding empirical distributions. We formulate such an objective as a non-convex and constrained optimization problem and solve the problem with an efficient stochastic alternating optimization procedure. We name the proposed backdoor attack as Wasserstein Backdoor (WB), which achieves a high attack success rate while being stealthy from both the input and latent spaces, as tested in several benchmark datasets, including MNIST, CIFAR10, GTSRB, and TinyImagenet.
@inproceedings{doan2021wb, bibtex_show = {true}, abbr = {NeurIPS}, title = {Backdoor Attack with Imperceptible Input and Latent Modification}, author = {Doan, Khoa D and Lao, Yingjie and Li, Ping}, booktitle = {Thirty-Fifth Conference on Neural Information Processing Systems}, year = {2021}, submissions = {NeurIPS'21}, pdf = {https://proceedings.neurips.cc/paper/2021/file/9d99197e2ebf03fc388d09f1e94af89b-Paper.pdf}, code = {https://github.com/khoadoan106/backdoor_attacks}, video = {https://recorder-v3.slideslive.com/?share=51522&s=8af881c0-56e8-451e-865f-adb1e90e5471}, slides = {doan2021wb-slides.pdf}, teaser = {doan2021wb.png}, selected = {true} }NeurIPS'21
-
ICCV LIRA: Learnable, Imperceptible and Robust Backdoor AttacksIn International Conference on Computer Vision 2021
Recently, machine learning models have demonstrated to be vulnerable to backdoor attacks, primarily due to the lack of transparency in black-box models such as deep neural networks. A third-party model can be poisoned such that it works adequately in normal conditions but behaves maliciously on samples with specific trigger patterns. However, the trigger injection function is manually defined in most existing backdoor attack methods, e.g., placing a small patch of pixels on an image or slightly deforming the image before poisoning the model. This results in a two-stage approach with a sub-optimal attack success rate and a lack of complete stealthiness under human inspection. In this paper, we propose a novel and stealthy backdoor attack framework, LIRA, which jointly learns the optimal, stealthy trigger injection function and poisons the model. We formulate such an objective as a non-convex, constrained optimization problem. Under this optimization framework, the trigger generator function will learn to manipulate the input with imperceptible noise to preserve the model performance on the clean data and maximize the attack success rate on the poisoned data. Then, we solve this challenging optimization problem with an efficient, two-stage stochastic optimization procedure. Finally, the proposed attack framework achieves 100% success rates in several benchmark datasets, including MNIST, CIFAR10, GTSRB, and T-ImageNet, while simultaneously bypassing existing backdoor defense methods and human inspection.
@inproceedings{doan2021lira, bibtex_show = {true}, abbr = {ICCV}, title = {LIRA: Learnable, Imperceptible and Robust Backdoor Attacks}, author = {Doan, Khoa D and Lao, Yingjie and Zhao, Weijie and Li, Ping}, booktitle = {International Conference on Computer Vision}, submissions = {ICCV'21}, code = {https://github.com/khoadoan106/backdoor_attacks}, slides = {https://github.com/sunbelbd/invisible_backdoor_attacks/raw/master/resources/ICCV2021-LIRA-Slides.pdf}, poster = {https://github.com/sunbelbd/invisible_backdoor_attacks/raw/master/resources/ICCV2021-LIRA-Poster.pdf}, pdf = {https://openaccess.thecvf.com/content/ICCV2021/papers/Doan_LIRA_Learnable_Imperceptible_and_Robust_Backdoor_Attacks_ICCV_2021_paper.pdf}, year = {2021}, teaser = {doan2021lira.png}, selected = {true} }ICCV'21
-
SIGIR Interpretable Graph Similarity Computation via Differentiable Optimal Alignment of Node EmbeddingsKhoa D Doan, Saurav Manchanda, Suchismit Mahapatra, and Chandan K ReddyIn 44th International ACM SIGIR Conference on Research and Development in Information Retrieval 2021
Computing graph similarity is an important task in many graph-related applications such as retrieval in graph databases or graph clustering. While numerous measures have been proposed to capture the similarity between a pair of graphs, Graph Edit Distance (GED) and Maximum Common Subgraphs (MCS) are the two widely used measures in practice. GED and MCS are domain-agnostic measures of structural similarity between the graphs and define the similarity as a function of pairwise alignment of different entities (such as nodes, edges, and subgraphs) in the two graphs. The explicit explainability offered by the pairwise alignment provides transparency and justification of the similarity score, thus, GED and MCS have important practical applications. However, their exact computations are known to be NP-hard. While recently proposed neural-network based approximations have been shown to accurately compute these similarity scores, they have limited ability in providing comprehensive explanations compared to classical combinatorial algorithms, e.g., Beam search. This paper aims at efficiently approximating these domain-agnostic similarity measures through a neural network, and simultaneously learning the alignments (i.e., explanations) similar to those of classical intractable methods. Specifically, we formulate the similarity between a pair of graphs as the minimal "transformation" cost from one graph to another in the learnable node-embedding space. We show that, if node embedding is able to capture its neighborhood context closely, our proposed similarity function closely approximates both the alignment and the similarity score of classical methods. Furthermore, we also propose an efficient differentiable computation of our proposed objective for model training. Empirically, we demonstrate that the proposed method achieves up to 50%-100% reduction in the Mean Squared Error for the graph similarity approximation task and up to 20% improvement in the retrieval evaluation metrics for the graph retrieval task. The source code is available at https://github.com/khoadoan/GraphOTSim.
@inproceedings{doan2021interpretable, bibtex_show = {true}, abbr = {SIGIR}, url = {https://doi.org/10.1145/3404835.3462960}, code = {https://github.com/khoadoan/GraphOTSim}, pdf = {doan2021interpretable.pdf}, slides = {https://github.com/khoadoan/GraphOTSim/raw/main/resources/SIGIR21-fp0937-slides.pdf}, video = {https://www.youtube.com/watch?v=IWxxsuFPsgs&t=1s}, teaser = {doan2021interpretable.png}, selected = {true}, submissions = {AAAI'21 -- WWW'21 -- SIGIR'21}, author = {Doan, Khoa D and Manchanda, Saurav and Mahapatra, Suchismit and Reddy, Chandan K}, title = {Interpretable Graph Similarity Computation via Differentiable Optimal Alignment of Node Embeddings}, year = {2021}, isbn = {9781450380379}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/3404835.3462960}, booktitle = {44th International ACM SIGIR Conference on Research and Development in Information Retrieval}, pages = {665–674}, numpages = {10}, keywords = {similarity search, model interpretability, graph similarity, GCN}, location = {Virtual Event, Canada}, series = {SIGIR '21} }AAAI'21 -- WWW'21 -- SIGIR'21
-
WWW Efficient Implicit Unsupervised Text Hashing Using Adversarial AutoencoderKhoa D Doan, and Chandan K ReddyIn Proceedings of The Web Conference 2020
Searching for documents with semantically similar content is a fundamental problem in the information retrieval domain with various challenges, primarily, in terms of efficiency and effectiveness. Despite the promise of modeling structured dependencies in documents, several existing text hashing methods lack an efficient mechanism to incorporate such vital information. Additionally, the desired characteristics of an ideal hash function, such as robustness to noise, low quantization error and bit balance/uncorrelation, are not effectively learned with existing methods. This is because of the requirement to either tune additional hyper-parameters or optimize these heuristically and explicitly constructed cost functions. In this paper, we propose a Denoising Adversarial Binary Autoencoder (DABA) model which presents a novel representation learning framework that captures structured representation of text documents in the learned hash function. Also, adversarial training provides an alternative direction to implicitly learn a hash function that captures all the desired characteristics of an ideal hash function. Essentially, DABA adopts a novel single-optimization adversarial training procedure that minimizes the Wasserstein distance in its primal domain to regularize the encoder’s output of either a recurrent neural network or a convolutional autoencoder. We empirically demonstrate the effectiveness of our proposed method in capturing the intrinsic semantic manifold of the related documents. The proposed method outperforms the current state-of-the-art shallow and deep unsupervised hashing methods for the document retrieval task on several prominent document collections.
@inproceedings{doan2020efficient, bibtex_show = {true}, abbr = {WWW}, url = {https://doi.org/10.1145/3366423.3380150}, html = {https://dl.acm.org/doi/abs/10.1145/3366423.3380150}, pdf = {https://people.cs.vt.edu/~reddy/papers/WWW20a.pdf}, teaser = {doan2020efficient.png}, selected = {true}, author = {Doan, Khoa D and Reddy, Chandan K}, title = {Efficient Implicit Unsupervised Text Hashing Using Adversarial Autoencoder}, year = {2020}, isbn = {9781450370233}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/3366423.3380150}, booktitle = {Proceedings of The Web Conference}, pages = {684–694}, numpages = {11}, keywords = {autoencoder, Hashing, adversarial training, deep learning.}, location = {Taipei, Taiwan}, series = {WWW '20} } -
arXiv Gradient boosting neural networks: GrownetSarkhan Badirli, Xuanqing Liu, Zhengming Xing, Avradeep Bhowmik, Khoa D Doan, and Sathiya K SelvarajarXiv preprint arXiv:2002.07971 2020
A novel gradient boosting framework is proposed where shallow neural networks are employed as ``weak learners''. General loss functions are considered under this unified framework with specific examples presented for classification, regression, and learning to rank. A fully corrective step is incorporated to remedy the pitfall of greedy function approximation of classic gradient boosting decision tree. The proposed model rendered outperforming results against state-of-the-art boosting methods in all three tasks on multiple datasets. An ablation study is performed to shed light on the effect of each model components and model hyperparameters.
@article{badirli2020gradient, bibtex_show = {true}, abbr = {arXiv}, url = {https://arxiv.org/abs/2002.07971}, code = {https://github.com/sbadirli/GrowNet}, pdf = {https://arxiv.org/pdf/2002.07971.pdf}, teaser = {badirli2020gradient.png}, selected = {true}, title = {Gradient boosting neural networks: Grownet}, author = {Badirli, Sarkhan and Liu, Xuanqing and Xing, Zhengming and Bhowmik, Avradeep and Doan, Khoa D and Selvaraj, Sathiya K}, journal = {arXiv preprint arXiv:2002.07971}, year = {2020} } -
CIKM Adversarial Factorization Autoencoder for Look-Alike ModelingKhoa D Doan, Pranjul Yadav, and Chandan K ReddyIn Proceedings of the 28th ACM International Conference on Information and Knowledge Management 2019
Digital advertising is performed in multiple ways, for e.g., contextual, display-based and search-based advertising. Across these avenues, the primary goal of the advertiser is to maximize the return on investment. To realize this, the advertiser often aims to target the advertisements towards a targeted set of audience as this set has a high likelihood to respond positively towards the advertisements. One such form of tailored and personalized, targeted advertising is known as look-alike modeling, where the advertiser provides a set of seed users and expects the machine learning model to identify a new set of users such that the newly identified set is similar to the seed-set with respect to the online purchasing activity. Existing look-alike modeling techniques (i.e., similarity-based and regression-based) suffer from serious limitations due to the implicit constraints induced during modeling. In addition, the high-dimensional and sparse nature of the advertising data increases the complexity. To overcome these limitations, in this paper, we propose a novel Adversarial Factorization Autoencoder that can efficiently learn a binary mapping from sparse, high-dimensional data to a binary address space through the use of an adversarial training procedure. We demonstrate the effectiveness of our proposed approach on a dataset obtained from a real-world setting and also systematically compare the performance of our proposed approach with existing look-alike modeling baselines.
@inproceedings{doan2019adversarial, bibtex_show = {true}, abbr = {CIKM}, pdf = {https://dmkd.cs.vt.edu/papers/CIKM19.pdf}, teaser = {doan2019adversarial.png}, selected = {true}, author = {Doan, Khoa D and Yadav, Pranjul and Reddy, Chandan K}, title = {Adversarial Factorization Autoencoder for Look-Alike Modeling}, year = {2019}, isbn = {9781450369763}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3357384.3357807}, doi = {10.1145/3357384.3357807}, booktitle = {Proceedings of the 28th ACM International Conference on Information and Knowledge Management}, pages = {2803–2812}, numpages = {10}, keywords = {deep learning, autoencoder, hashing, factorization, adversarial training, look-alike modeling}, location = {Beijing, China}, series = {CIKM '19} }
Open Office Hour
I will ocassionally be holding group open office hours (fully ONLINE) for *anyone*. Feel free to sign up to connect, chat, or ask any questions.
When I was a student, I was clueless sometimes (if not most of the time) and I had no idea how to get help. I hope that, via this modest effort, I can share some experience with you, as well as address some questions you may have, using my experience working in both industry and academia and applied and research projects, as well as experience in studying abroad in the US. I encourage to converse in English.
This effort is inspired by ML Collective