Source Themes

Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Jan 1, 0001

TD;LR: This paper proposes SANTA, a post-training scheme that first exposes the model’s hallucination tendency via negative construction, and then uses contrastive learning to align extracted visual-tracklet features with language-phrase features.

Jan 1, 0001

TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors

Jan 1, 0001

TD;LR: This paper proposes TA-Prompting, a two-stage Video-LLM post-training scheme that uses proposed temporal anchors to improve event localization and dense video captioning.

Jan 1, 0001

EMLoC: Emulator-based Memory-efficient Fine-tuning with LoRA Correction

Jan 1, 0001

TD;LR: This paper proposes EMLoC, an emulator-based LoRA fine-tuning method that lets users train large models with the same memory cost as inference (e.g., training a 38B model only requires 24G GPU).

Jan 1, 0001

Videomage: Multi-subject and motion customization of text-to-video diffusion models

Jan 1, 0001

TD;LR: This paper introduces Videomage, the first framework to enable multi-subject and motion customization of text-to-video diffusion models.

Jan 1, 0001

Serial Lifelong Editing via Mixture of Knowledge Experts

Jan 1, 0001

TD;LR: This paper proposes a Mixture-of-Knowledge-Experts framework with an activation-guided routing mechanism for serial lifelong knowledge editing in LLMs.

Jan 1, 0001

Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

Jan 1, 0001

TD;LR: This paper proposes Select-and-Distill, a dual-teacher distillation method that preserves zero-shot ability while reducing forgetting in continual VLM learning.

Jan 1, 0001

Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers

Jan 1, 0001

TD;LR: This paper introduces Receler for erasing concepts from pre-trained diffusion models, exhibiting sufficient locality (i.e., w/o affecting non-target concepts) and robustness (i.e., against paraphrased and adversarial attacks) properties.

Jan 1, 0001

RAPPER: Reinforced Rationale-Prompted Paradigm for Natural Language Explanation in Visual Question Answering

Jan 1, 0001

TD;LR: This paper introduces Rapper, RAPPER, a two-stage training paradigm, for VLM to mitigate implausibility and hallucination issues in generating natural language explanations (NLEs) through reinforced language-based feedback.

Jan 1, 0001