Keynote - Nanyang Speech Technology Forum 2025

Exploration of Song Generation and Evaluation Frameworks

Abstract

Music, as a fundamental component of human culture, embodies emotional expression and creative innovation. From classical to contemporary periods, its forms and modes of creation have continually evolved. In recent years, the rapid advancement of artificial intelligence has profoundly transformed the field of music generation, with generative models offering powerful capabilities for the automatic composition of high-quality musical works. This talk focuses on song generation and evaluation, examining methodologies for music creation and aesthetic assessment based on generative modeling. We first introduce an end-to-end song generation framework — DiffRhythm that integrates melody, lyrics, and vocal synthesis within a unified architecture. Subsequently, we present the construction of a music aesthetics evaluation dataset–SongEval, which provides a reliable foundation for assessing the artistic quality and perceptual appeal of generated songs. Building upon these components, we further propose an enhanced generation framework DiffRhythm+ that refines musicality and expressive creativity through improved model design and evaluation feedback mechanisms. Through this talk, the report highlights recent explorations of generative modeling and evaluative technologies in the domain of music generation, aiming to contribute new perspectives and methodological insights for future research in automatic song generation and assessment.

Biography

Lei Xie is a Professor at the School of Computer Science, Northwestern Polytechnical University (NPU), Xi’an, China, where he leads the Audio, Speech and Language Processing Laboratory (ASLP@NPU). Prior to joining NPU, he held research positions at Vrije Universiteit Brussel (VUB), City University of Hong Kong, and The Chinese University of Hong Kong. Professor Xie has authored more than 400 peer-reviewed papers in leading journals and conferences on speech and audio processing, which have collectively received over 15,000 citations according to Google Scholar. In 2024, he was recognized on the Stanford University and Elsevier list of the world’s most highly cited scientists. His current research interests span a broad range of topics in speech and language processing, multimedia, and human–computer interaction. He serves as a Senior Area Editor for IEEE/ACM Transactions on Audio, Speech, and Language Processing and IEEE Signal Processing Letters. He is also the Vice Chair of the ISCA Special Interest Group on Chinese Spoken Language Processing (ISCA-CSLP) and has previously served as a member of the IEEE Speech and Language Technical Committee (SLTC).