Skip to content

3loi/NaturalVoices

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NaturalVoices Dataset & Pipeline

NaturalVoices introduces a novel data-sourcing pipeline alongside the release of a new natural speech dataset for voice conversion (VC). This pipeline leverages proven, high-performance techniques to extract detailed information such as Automatic Speech Recognition (ASR), speaker diarization, and signal-to-noise ratio (SNR) from raw podcast data. Using the pipeline we create a large-scale, spontaneous, expressive, and emotionally rich speech dataset tailored for VC applications. Objective and subjective evaluations demonstrate the effectiveness of using our pipeline for providing natural and expressive data for VC.

Pipeline Architecture:

alt text

The above image is an illustration of our data sourcing pipeline with various modules.

To see an overview of audio segments visit the Pages website [website].


Downloading the audios

The audio files are zipped and uploaded in batches. Each zip file can be unzipped individually and is around 40GB so please ensure you have sufficient free storage and be patient, as the download process may take some time.

The audios will be saved in the audios_zipped in working directory. To automatically download all the zipped files, please run the following command:

$ bash download_audios.sh

If you wish to manually download a file, please visit this [website].


Downloading the meta-data

The meta-data contains the output of running Faster-Whisper, PyAnnote (Diarization Voice Activity Detection Speaker Overlap) and all_data.json which contains the utterance level predictions.

To download the meta-data run the following command:

$ bash download_meta.sh

If you wish to manually download a file, please visit this [website].


File Structure

After downloading all the files, you should have the following file structure:

NaturalVoices
	vad
		MSP-PODCAST_0001
		...
	pyannote
		MSP-PODCAST_0001
		...
	faster-whisper
		MSP-PODCAST_0001
		...
	all_data.json

For an example on how to open and show the meta-data please open the example_code file. In summary: Each file inside the directories is a pickle file that can be loaded in Python using the following code:

def load_pickle(file_path):
    with open(file_path, 'rb') as f:
        data =  pickle.load(f)
    return data

Running the pipeline

The code used to generate the labels is located in pipeline_code. There are three main steps we used to generate NaturalVoices.

Before running the pipeline code, please update the config.py file with the correct pathways (output_path, vad_output_path, etc) for each output folder, as well as, the "auth_key" for pyannote/huggingface.

  1. Run the podcast level code
  2. Create the utterances
    • This step uses the segments from whisper to define the utterances
    • generate_utt
  3. Run the utterance level code

TODO

  • Upload 16KHz raw audio
  • Upload ASR output (Faster-Whisper)
  • Upload Diarization output (PyAnnote)
  • Upload Voice Activity Detection output (PyAnnote)
  • Upload speaker overlap output (PyAnnote)
  • Upload Gender & Age info
  • Upload Signal-to-Noise ratio
  • Upload Categorical and Attribute based emotion prediction
  • Upload Sound Event predictions
  • Upload the pipeline code

To cite this work, please use the following BibTeX entry:

@InProceedings{Salman_2024,
            author={A. N. Salman and Z. Du and S. S. Chandra and I. R. Ulgen and and C. Busso and B. Sisman},
            title={Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline},
            booktitle={Interspeech 2024},
            volume={},
            year={2024},
            month={September},
            address =  {Kos Island, Greece},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published