Tidy Tunes is an easy-to-use pipeline for mining high-quality audio data for speech generation models. To do so, it chains multiple open source models while minimizing dependencies.
Specifically, the pipeline includes (see also https://www.arxiv.org/pdf/2409.03283):
- voice source segmentation
- segmentation based on voice activity detection
- speaker segmentation
- rolloff frequency filtering
- filtering based on PESQ of denoised audio
- DNSMOS filtering
- spoken language identification filtering
- accent / dialect classification
- background music detection
- dual ASR verification
- single-model ASR transcription
Each filtering component can run in filtering mode (keep/drop segments based on a threshold) or annotation mode (attach computed metadata to segments and save it alongside the audio). Both modes can be combined in a single pipeline step.
It provides two commands, one for downloading audios from sources like YouTube, and one for processing the downloaded audio efficiently.
You can install Tidy Tunes directly from the GitHub repository using pip:
pip install git+https://github.com/meaningTeam/tidy-tunes.gitIf you wish, clone the repository or install additional dependencies:
git clone git@github.com:meaningTeam/tidy-tunes.git && cd tidy-tunes
pip install -e .[dev]The CLI offers two commands, one for downloading audio from videos from the internet (YouTube), the other one for processing audio files with a stack of data-cleaning models.
To download the audios, use:
tidytunes download-youtube [OPTIONS] [VIDEO_IDS]...
Options:
-d, --download_dir PATH Ouput directory
-k, --proxy-api-key TEXT API key for the proxy server. [required]
-e, --proxy-endpoint TEXT Proxy endpoint address.
-w, --max-workers INTEGER Number of parallel download threads.
-l, --subtitle-language TEXT Language code for substitle language.
You can place your secret proxy API_KEY into ~/.tidytunes and it will be auto-discovered by the script, for example:
API_KEY=Wk1yM8sY9M4S7rT3g5Xq2L6bV0NfJHp
To process the downloaded audios, use:
tidytunes process-audios [OPTIONS] [AUDIO_PATHS]...
Options:
-c, --config PATH Path to the YAML config file defining pipeline components and parameters. [required]
-o, --out PATH Destination directory to save processed
files. [required]
-d, --device [cpu|cuda] Device on which to run the pipeline.
-w, --overwrite Overwrite processed files.
You can find the default config at configs/full_en.yaml. If needed, modify or disable the pipeline components and their parameters.
Each pipeline step in the YAML config supports the following keys:
| Key | Description |
|---|---|
name |
Name of the pipeline component (required). |
params |
Dict of keyword arguments passed to the component function. |
condition |
A Python lambda that receives the per-segment score and returns True to keep the segment. |
annotate |
A string key under which the component's output is stored in the segment's annotations. |
The interplay of condition and annotate determines the behaviour:
condition |
annotate |
Behaviour |
|---|---|---|
| absent | absent | Segmenting — the component splits/trims audio (e.g. VAD, diarization). |
| present | absent | Filtering — segments that fail the condition are dropped. |
| absent | present | Annotation only — the score is stored, all segments pass through. |
| present | present | Annotate + filter — the score is stored, then segments are filtered. |
When annotations are present, they are saved as a JSON sidecar file (.json) next to each output audio segment. Some components provide rich annotations beyond a single scalar — for example, language ID and accent classification store the full probability distribution, and ASR comparison stores both transcripts alongside the WER.
See configs/annotate_en.yaml for an example config that showcases all annotation modes.
Tidy Tunes can also be used directly in Python. Example usage:
from tidytunes.utils import Audio, trim_audios, partition
from tidytunes.pipeline_components import find_segments_with_single_speaker, get_dnsmos
device = "cuda"
path = "path/to/my_audio.flac"
output_dir = "processed/"
audio_segments = [Audio.from_file(path)]
timestamps = find_segments_with_single_speaker(audio_segments, min_duration=3.2, device=device)
audio_segments = trim_audios(audio_segments, timestamps)
mos = get_dnsmos(audio_segments, device=device)
audio_segments, poor_quality_audio_segments = partition(audio_segments, by=mos > 3.3)
for segment in zipaudio_segments:
segment.play()
segment.to_file(output_dir)If used, please cite it as follows:
@misc{tidytunes,
author = {Tomas Nekvinda and Jan Vainer and Vojtech Srdecny},
title = {Tidy Tunes: An Easy-to-Use Pipeline for Mining High-Quality Audio Data},
year = {2025},
howpublished = {\url{https://github.com/meaningTeam/tidy-tunes}},
note = {Version 1.1.0}
}
Tidy Tunes is released under the MIT License. See LICENSE for details.