Offline‑Multi‑Model TTS Generation

Offline Multi‑Model TTS Generation Pipeline

GPU‑accelerated, multiprocessing pipeline for high‑throughput speech synthesis. Designed for reproducibility, offline operation, and robust logging.

⬇ Download Demo ↗ View Code

250,000+

Total Samples (run)

RTX 4090

GPU

8 And Counting

Models Integrated

100 GB+

Dataset Size

Scope: integration of multiple open‑source TTS models for offline, high‑throughput inference; focus on orchestration, batching, logging, and GPU utilization. Models are third‑party.

Pipeline In Action

ElevenLabs API Integration

In addition to offline pipelines, I built a lightweight client around the ElevenLabs API for rapid dataset generation and sound effect prototyping. This tool lets me batch-generate audio samples from text or CSV inputs, track metadata in logs, and export results in standard formats.

Supports batch input via .csv or .tsv
Logs generation status (success/failure, filenames, durations)
Configurable voice/model settings with simple CLI options

↗ View ElevenLabs Code ⬇ Download Sample Prompts

Audio Samples

▶ Sample 1

Dia model — style-conditioned generation

▶ Sample 2

ElevenLabs API — sound effect (Subway leaving station)

▶ Sample 3

Meta Audiocraft - music generation

Technologies Used

PythonPyTorchMultiprocessingCUDAHugging Face (offline cache)TSV/CSV I/OtqdmBigVGAN / Higgs Audio 3B

Pipeline Diagram

Supported Models

Each script create a new TTS Model Generator.

IndexTTS

cloning/style; ref text + ref audio

Chatterbox

style‑conditioned generation

Dia

dialogue synthesis; multi‑turn

CSM‑1B

compact model; faster batches

Kokoro

lightweight; quick previews

Models are third‑party; this project focuses on orchestration & batch generation.

Problem & Motivation

Generating large volumes of high‑quality speech is slow, expensive, and hard to reproduce at scale. This project builds a robust, offline‑first pipeline capable of synthesizing tens of thousands of samples while keeping GPU utilization high and logs auditable.

Where It’s Useful

Dataset creation & augmentation
Voice cloning and style‑conditioned synthesis
Evaluation of deepfake detection on non‑speech sounds

Technical Highlights

Offline‑first execution

Models and tokenizers from local HF cache; deterministic runs; no external API.

Multiprocessing at scale

Workers load models once; shared queues; back‑pressure; graceful failure + retries.

GPU device affinity

CUDA device pinning via env; per‑worker memory control; safe KV‑cache handling.

TSV/CSV‑driven batches

Ref text/audio + target text; per‑sample metadata; idempotent resume.

Robust logging

Per‑sample CSV log, tqdm live progress, aggregate stats and error reports.

Post‑processing to WAV

Audio‑ID stitching, normalization, optional filters, consistent filenames.

Files & Artifacts

Demo Dataset (ZIP)

/public/projects/tts-pipeline/Sample-dataset.zip

Download

Generation Log (CSV)

/public/projects/tts-pipeline/generation_log.csv

Download

Sample Log (TXT)

/public/projects/tts-pipeline/sample-log.txt

Download

Sample Records

sample_id	text	ref_audio	status	latency_ms
cv_en_39586712	The quick brown fox…	speaker_12.wav	ok	712
cv_en_40953339	Hello from the pipeline…	speaker_02.wav	ok	689
cv_en_39586770	KV cache edge case…	speaker_45.wav	skipped	0

Replace with live data or hydrate from a JSON/CSV API route.

This page showcases system design, scale, and engineering details recruiters care about. Audio clips and files above are placeholders—swap in your real artifacts when ready.