Yuncong Yu • SWE/ML
Offline‑Multi‑Model TTS Generation

Offline Multi‑Model TTS Generation Pipeline

GPU‑accelerated, multiprocessing pipeline for high‑throughput speech synthesis. Designed for reproducibility, offline operation, and robust logging.

250,000+
Total Samples (run)
RTX 4090
GPU
8 And Counting
Models Integrated
100 GB+
Dataset Size

Scope: integration of multiple open‑source TTS models for offline, high‑throughput inference; focus on orchestration, batching, logging, and GPU utilization. Models are third‑party.

Pipeline In Action

Pipeline overview

ElevenLabs API Integration

In addition to offline pipelines, I built a lightweight client around the ElevenLabs API for rapid dataset generation and sound effect prototyping. This tool lets me batch-generate audio samples from text or CSV inputs, track metadata in logs, and export results in standard formats.

  • Supports batch input via .csv or .tsv
  • Logs generation status (success/failure, filenames, durations)
  • Configurable voice/model settings with simple CLI options

Audio Samples

▶ Sample 1

Dia model — style-conditioned generation

▶ Sample 2

ElevenLabs API — sound effect (Subway leaving station)

▶ Sample 3

Meta Audiocraft - music generation

Technologies Used

PythonPyTorchMultiprocessingCUDAHugging Face (offline cache)TSV/CSV I/OtqdmBigVGAN / Higgs Audio 3B

Pipeline Diagram

Supported Models

Each script create a new TTS Model Generator.

IndexTTS
cloning/style; ref text + ref audio
Chatterbox
style‑conditioned generation
Dia
dialogue synthesis; multi‑turn
CSM‑1B
compact model; faster batches
Kokoro
lightweight; quick previews

Models are third‑party; this project focuses on orchestration & batch generation.

Problem & Motivation

Generating large volumes of high‑quality speech is slow, expensive, and hard to reproduce at scale. This project builds a robust, offline‑first pipeline capable of synthesizing tens of thousands of samples while keeping GPU utilization high and logs auditable.

Where It’s Useful

  • Dataset creation & augmentation
  • Voice cloning and style‑conditioned synthesis
  • Evaluation of deepfake detection on non‑speech sounds

Technical Highlights

Offline‑first execution

Models and tokenizers from local HF cache; deterministic runs; no external API.

Multiprocessing at scale

Workers load models once; shared queues; back‑pressure; graceful failure + retries.

GPU device affinity

CUDA device pinning via env; per‑worker memory control; safe KV‑cache handling.

TSV/CSV‑driven batches

Ref text/audio + target text; per‑sample metadata; idempotent resume.

Robust logging

Per‑sample CSV log, tqdm live progress, aggregate stats and error reports.

Post‑processing to WAV

Audio‑ID stitching, normalization, optional filters, consistent filenames.

Files & Artifacts

Sample Records

sample_idtextref_audiostatuslatency_ms
cv_en_39586712The quick brown fox…speaker_12.wavok712
cv_en_40953339Hello from the pipeline…speaker_02.wavok689
cv_en_39586770KV cache edge case…speaker_45.wavskipped0

Replace with live data or hydrate from a JSON/CSV API route.

This page showcases system design, scale, and engineering details recruiters care about. Audio clips and files above are placeholders—swap in your real artifacts when ready.