-
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen)
- https://www.zhangxueyao.com/
Highlights
- Pro
Stars
Code for ICML2020 paper - CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information
[ACL 2024] Official PyTorch code for extracting features and training downstream models with emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
Inference and training library for high-quality TTS models.
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
The official GitHub page for the survey paper "Foundation Models for Music: A Survey".
A library for speech data augmentation in time-domain
Diffusion Model for Voice Conversion
PolySinger: Singing-Voice to Singing-Voice Translation From English to Japanese
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds. AI拟音大师,给你的无声视频添加生动而且同步的音效 😝
AI Audio Datasets (AI-ADS) 🎵, including Speech, Music, and Sound Effects, which can provide training data for Generative AI, AIGC, AI model training, intelligent audio tool development, and audio a…
Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793
This is the GitHub page for publicly available emotional speech data.
Public Code for Neural Codec Language Models for Disentangled and Textless Voice Conversion (Interspeech 2024)
LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning
A generative speech model for daily dialogue.
Pitch Estimating Neural Networks (PENN)
This is the code for the SpeechTokenizer presented in the SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. Samples are presented on
State-of-the-art audio codec with 90x compression factor. Supports 44.1kHz, 24kHz, and 16kHz mono/stereo audio.
Lumina-T2X is a unified framework for Text to Any Modality Generation
Paper list of misinformation research using (multi-modal) large language models, i.e., (M)LLMs.
An extremely fast Python linter and code formatter, written in Rust.
Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"