Publications

(* indicates equal contribution)

Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
Kai-Po Chang , Wei-Yuan Cheng , Chi-Pin Huang , Fu-En Yang , Yu-Chiang Frank Wang

TD;LR: This paper proposes SANTA, a post-training scheme that exposes video MLLMs’ hallucination tendencies via contrasted negative captions and aligns regional objects and relation-guided actions with visual-temporal phrases in MLLMs.

TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors

TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
Wei-Yuan Cheng* , Kai-Po Chang* , Chi-Pin Huang , Fu-En Yang , Yu-Chiang Frank Wang

TD;LR: This paper proposes TA-Prompting, a two-stage Video-LLM post-training scheme that uses proposed temporal anchors to improve event localization and dense video captioning.

EMLoC: Emulator-based Memory-efficient Fine-tuning with LoRA Correction

EMLoC: Emulator-based Memory-efficient Fine-tuning with LoRA Correction

Neural Information Processing Systems (NeurIPS) 2025
Hsi-Che Lin , Yu-Chu Yu , Kai-Po Chang , Yu-Chiang Frank Wang

TD;LR: This paper proposes EMLoC, an emulator-based LoRA fine-tuning method that lets users train large models with the same memory cost as inference (e.g., training a 38B model only requires 24G GPU).

Videomage: Multi-subject and motion customization of text-to-video diffusion models

Videomage: Multi-subject and motion customization of text-to-video diffusion models

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2025
Chi-Pin Huang , Yen-Siang Wu , Hung-Kai Chung , Kai-Po Chang , Fu-En Yang , Yu-Chiang Frank Wang

TD;LR: This paper introduces Videomage, the first framework to enable multi-subject and motion customization of text-to-video diffusion models.

Serial Lifelong Editing via Mixture of Knowledge Experts

Serial Lifelong Editing via Mixture of Knowledge Experts

Association for Computational Linguistics (ACL) 2025
YuJu Cheng , Yu-Chu Yu , Kai-Po Chang , Yu-Chiang Frank Wang

TD;LR: This paper proposes a Mixture-of-Knowledge-Experts framework with an activation-guided routing mechanism for serial lifelong knowledge editing in LLMs.

Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

European Conference on Computer Vision (ECCV) 2024
Yu-Chu Yu , Chi-Pin Huang , Jr-Jen Chen , Kai-Po Chang , Yung-Hsuan Lai , Fu-En Yang , Yu-Chiang Frank Wang

TD;LR: This paper proposes Select-and-Distill, a dual-teacher distillation method that preserves zero-shot ability while reducing forgetting in continual VLM learning.

Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers

Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers

European Conference on Computer Vision (ECCV) 2024
Chi-Pin Huang* , Kai-Po Chang* , Chung-Ting Tsai , Yung-Hsuan Lai , Fu-En Yang , Yu-Chiang Frank Wang

TD;LR: This paper introduces Receler for erasing concepts from pre-trained diffusion models, exhibiting sufficient locality (i.e., w/o affecting non-target concepts) and robustness (i.e., against paraphrased and adversarial attacks) properties.

RAPPER: Reinforced Rationale-Prompted Paradigm for Natural Language Explanation in Visual Question Answering

RAPPER: Reinforced Rationale-Prompted Paradigm for Natural Language Explanation in Visual Question Answering

International Conference on Learning Representations (ICLR) 2024
Kai-Po Chang , Chi-Pin Huang , Wei-Yuan Cheng , Fu-En Yang , Chien-Yi Wang , Yung-Hsuan Lai , Yu-Chiang Frank Wang

TD;LR: This paper introduces Rapper, RAPPER, a two-stage training paradigm, for VLM to mitigate implausibility and hallucination issues in generating natural language explanations (NLEs) through reinforced language-based feedback.