Kaisi Guan

I am a Master student at Gaoling School of Artificial Intelligence (GSAI), Renmin University of China (RUC). I am advised by Prof. Ruihua Song. Prior to this, I got my bachelor’s degree from GSAI in 2024.

My research centers on Omni understanding & generation, building systems that perceive the world across modalities, and learn to generate it in turn with multimodal.

Main Research Interests:

Omni Generation: Building controllable, high-fidelity generative models for video and audio, exploring how their modeling paradigms converge toward a unified framework, along with post-training methods for better alignment and controllability.
Omni Understanding: Investigating vision–language–audio interplay and building omni-modal models with stronger understanding of video and audio.

I will graduate in 2027 and am seeking job opportunities in video / audio / image generation. Feel free to contact me at guankaisi@ruc.edu.cn.

News

June, 2026	Our work Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction was accepted by ECCV 2026!
June, 2026	Our work ChronusOmni: Improving Time Awareness of Omni Large Language was accepted by ECCV 2026!
April, 2026	Our work HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation was accepted by ICME 2026!
June, 2025	Our work ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering was accepted by ICCV 2025!
Sep , 2024	Our work BSharedRAG: Backbone Shared Retrieval-Augmented Generation for the E-commerce Domain was accepted by EMNLP 2024 !

Experiences

2026.4 - Present	Weixin @ , Tencent. Advised by Wenjing Wang	Intern
2025.1 - 2025.10	AIML @ Apple. Advised by Jeff Lai and Kieran Liu	Intern
2023.1 - 2023.7	ModelBest @	Intern
2024.9 - Present	AIMind Lab @ Gaoling School of Artificial Intelligence, Advised by Ruihua Song	Master Student
2020.9 - 2024.06	Gaoling School of Artificial Intelligence, RUC	Undergraduate student

Selected Publications

arXiv

Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

Kaisi Guan, Xihua Wang, Zhengfeng Lai, Xin Cheng, Peng Zhang, XiaoJiang Liu, Ruihua Song, Meng Cao

arXiv preprint arXiv:2510.03117 2026

arXiv PDF Project Page
ICCV 2025

ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering

Kaisi Guan, Zhengfeng Lai, Yuchong Sun, Peng Zhang, Wei Liu, Kieran Liu, Meng Cao, and Ruihua Song

International Conference on Computer Vision, ICCV 2025, Oct 2025

arXiv PDF Project Page
EMNLP 2024

BSharedRAG: Backbone Shared Retrieval-Augmented Generation for the E-commerce Domain

Kaisi Guan, Qian Cao, Yuchong Sun, Xiting Wang, and Ruihua Song

In Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024

arXiv PDF Project Page
arXiv

SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

Xin Cheng, Xihua Wang, Ying Ba, Yuyue Wang, Kaisi Guan, Yinbo Wang, Wenpu Li, Ruihua Song

arXiv preprint arXiv:2605.12179 2026

arXiv PDF Project Page
ICME 2026

HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluatio

Bingzi Zhang, Kaisi Guan, Ruihua Song

IEEE International Conference on Multimedia and Expo (ICME 2026).

arXiv PDF
arXiv

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song

arXiv preprint arXiv:2509.24773 2026

arXiv PDF Project Page
arXiv

ChronusOmni: Improving Time Awareness of Omni Large Language Models

Yijing Chen, Yihan Wu, Kaisi Guan, Yuchen Ren, Yuyue Wang, Ruihua Song, Liyun Ru

arXiv preprint arXiv:2512.09841 2026

arXiv PDF