Kaisi Guan (关开思)
I am a Master student at Gaoling School of Artificial Intelligence (GSAI), Renmin University of China (RUC). I am advised by Prof. Ruihua Song. Prior to this, I got my bachelor’s degree from GSAI in 2024.
My research centers on Omni understanding & generation, building systems that perceive the world across modalities, and learn to generate it in turn with multimodal.
Main Research Interests:
-
Omni Generation: Building controllable, high-fidelity generative models for video and audio, exploring how their modeling paradigms converge toward a unified framework, along with post-training methods for better alignment and controllability.
-
Omni Understanding: Investigating vision–language–audio interplay and building omni-modal models with stronger understanding of video and audio.
I will graduate in 2027 and am seeking job opportunities in video / audio / image generation. Feel free to contact me at guankaisi@ruc.edu.cn.
News
| June, 2026 | Our work Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction was accepted by ECCV 2026! |
|---|---|
| June, 2026 | Our work ChronusOmni: Improving Time Awareness of Omni Large Language was accepted by ECCV 2026! |
| April, 2026 | Our work HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation was accepted by ICME 2026! |
| June, 2025 | Our work ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering was accepted by ICCV 2025! |
| Sep , 2024 | Our work BSharedRAG: Backbone Shared Retrieval-Augmented Generation for the E-commerce Domain was accepted by EMNLP 2024 ! |
Experiences
| 2026.4 - Present | Weixin @ , Tencent. Advised by Wenjing Wang |
Intern |
|---|---|---|
| 2025.1 - 2025.10 | AIML @ Apple. Advised by Jeff Lai and Kieran Liu
| Intern |
| 2023.1 - 2023.7 | ModelBest @
| Intern |
| 2024.9 - Present | AIMind Lab @ Gaoling School of Artificial Intelligence,
Advised by Ruihua Song |
Master Student |
| 2020.9 - 2024.06 | Gaoling School of Artificial Intelligence, RUC |
Undergraduate student |
Selected Publications
-
Taming Text-to-Sounding Video Generation via Advanced Modality Condition and InteractionarXiv preprint arXiv:2510.03117 2026
-
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and AnsweringInternational Conference on Computer Vision, ICCV 2025, Oct 2025
-
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference LearningarXiv preprint arXiv:2605.12179 2026
-
VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint LearningarXiv preprint arXiv:2509.24773 2026
@ , Tencent. Advised by
Apple. Advised by
Advised by