TD;LR: This paper proposes SANTA, a post-training scheme that exposes video MLLMs’ hallucination tendencies via contrasted negative captions and aligns regional objects and relation-guided actions with visual-temporal phrases in MLLMs.
TD;LR: This paper proposes TA-Prompting, a two-stage Video-LLM post-training scheme that uses proposed temporal anchors to improve event localization and dense video captioning.
TD;LR: This paper proposes EMLoC, an emulator-based LoRA fine-tuning method that lets users train large models with the same memory cost as inference (e.g., training a 38B model only requires 24G GPU).
TD;LR: This paper introduces Videomage, the first framework to enable multi-subject and motion customization of text-to-video diffusion models.
TD;LR: This paper proposes a Mixture-of-Knowledge-Experts framework with an activation-guided routing mechanism for serial lifelong knowledge editing in LLMs.
TD;LR: This paper proposes Select-and-Distill, a dual-teacher distillation method that preserves zero-shot ability while reducing forgetting in continual VLM learning.
TD;LR: This paper introduces Receler for erasing concepts from pre-trained diffusion models, exhibiting sufficient locality (i.e., w/o affecting non-target concepts) and robustness (i.e., against paraphrased and adversarial attacks) properties.
TD;LR: This paper introduces Rapper, RAPPER, a two-stage training paradigm, for VLM to mitigate implausibility and hallucination issues in generating natural language explanations (NLEs) through reinforced language-based feedback.