Manav volunteered with a Lucknow school for visually impaired students. The school's textbooks weren't available as audiobooks in Hindi-Urdu mix. He used Coqui TTS with the IndicTTS Hindi voices to generate 80 hours of textbook audio over a weekend.
A YouTube creator then offered him ₹15,000 to clone a famous actress's voice from public interviews so the creator could "make her say whatever I want." Manav refused. The lesson covers both the technology and why he said no.
Modern TTS is two stages:
- Text → Mel-spectrogram: A model (Tacotron2, FastSpeech2, or VITS encoder) converts text to a 2D representation of frequency over time.
- Spectrogram → Audio waveform: A vocoder (HiFi-GAN, WaveRNN, or VITS decoder) converts the spectrogram to actual sound samples (22050 Hz audio).
VITS combines both stages end-to-end — better quality, simpler pipeline. It's what most current open-source TTS uses.
| System | Quality (MOS) | Latency | Indian Languages |
|---|---|---|---|
| Coqui XTTS-v2 | 4.2 / 5 | 200ms | 17 langs incl. Hindi |
| IndicTTS (AI4Bharat) | 4.0 / 5 | 150ms | 13 Indian langs |
| Bark | 4.1 / 5 | 5–8s | Multi-lingual |
| Google Cloud TTS | 4.4 / 5 | 100ms | 14 Indian langs |
| ElevenLabs | 4.6 / 5 | 250ms | Limited Indian |
!pip install -q TTS
from TTS.api import TTS
import os
# XTTS v2 — multilingual, supports Hindi
tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
def text_to_audiobook(chapter_path: str, output_path: str, lang: str = "hi"):
"""Convert one chapter file to an MP3 audiobook segment."""
with open(chapter_path, "r", encoding="utf-8") as f:
text = f.read()
# Chunk by sentence to avoid memory blow-up on long text
sentences = [s.strip() for s in text.split("।") if s.strip()]
audio_segments = []
for i, sentence in enumerate(sentences):
wav = tts.tts(
text=sentence,
language=lang,
speaker_wav="reference_voice.wav", # 6-second clip of approved voice
)
audio_segments.append(wav)
if i % 10 == 0:
print(f" {i}/{len(sentences)} sentences synthesised")
# Concatenate and save
import numpy as np
full_audio = np.concatenate(audio_segments)
tts.synthesizer.save_wav(full_audio, output_path)
# Generate the whole textbook
for chapter in os.listdir("./hindi_textbook/"):
text_to_audiobook(
chapter_path=f"./hindi_textbook/{chapter}",
output_path=f"./audiobook/{chapter.replace('.txt', '.wav')}",
lang="hi",
)
The same XTTS model that read Manav's textbooks can clone any voice from a 6-second sample. That's the offer he received — clone a celebrity voice without consent. Why he refused:
- Personality rights: Indian courts (Anil Kapoor v. Simply Life India, 2023) recognise a celebrity's right to control their voice and likeness. Cloning without consent invites injunctions and damages.
- IT Rules 2021 + 2023 amendment: Synthetic media impersonating real people must be clearly labelled. Defamation via deepfake is criminal under IPC §499/500.
- DPDPA 2023: Voice is biometric/personal data. Processing it (training a clone) without consent violates the Act.
- Only synthesise voices with explicit, written consent of the speaker.
- Never imitate a real, named person without their permission.
- Always announce "This is an AI-generated voice" at the start of synthesised audio.
- Add an inaudible watermark (e.g., AudioSeal) for provenance.
- Reject any request that involves making someone "say something they didn't say".
For the audiobook project, the school had a teacher record 30 minutes of speech and signed a consent form. Manav used that as the reference voice. The students and parents knew the voice was synthetic and consented to its use.
The other side of TTS is detection. Open-source detectors that work in 2026:
- AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal graph attention. Best open-source detector.
- RawNet3: Lightweight detector that runs on CPU.
- AudioSeal (Meta): Watermark embedded in TTS output, recoverable even after re-recording.
from speechbrain.pretrained import SpeakerRecognition
verification = SpeakerRecognition.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb")
# Compare suspect audio to known real samples of the speaker
score, prediction = verification.verify_files(
"real_speaker_sample.wav", "suspect_audio.wav"
)
print(f"Same speaker probability: {score.item():.3f}")
Detection is an arms race. The right defence is provenance (watermarks + signed metadata) plus media literacy, not detection alone.