Lesson 06 — Speech AI: TTS & Voice Cloning Ethics | Class 12

Story

Manav's Audiobook Project (and the Offer He Refused)

👨‍🎤 Manav · Lucknow · Age 17

Manav volunteered with a Lucknow school for visually impaired students. The school's textbooks weren't available as audiobooks in Hindi-Urdu mix. He used Coqui TTS with the IndicTTS Hindi voices to generate 80 hours of textbook audio over a weekend.

A YouTube creator then offered him ₹15,000 to clone a famous actress's voice from public interviews so the creator could "make her say whatever I want." Manav refused. The lesson covers both the technology and why he said no.

Tech

How Modern TTS Works

Modern TTS is two stages:

Text → Mel-spectrogram: A model (Tacotron2, FastSpeech2, or VITS encoder) converts text to a 2D representation of frequency over time.
Spectrogram → Audio waveform: A vocoder (HiFi-GAN, WaveRNN, or VITS decoder) converts the spectrogram to actual sound samples (22050 Hz audio).

VITS combines both stages end-to-end — better quality, simpler pipeline. It's what most current open-source TTS uses.

System	Quality (MOS)	Latency	Indian Languages
Coqui XTTS-v2	4.2 / 5	200ms	17 langs incl. Hindi
IndicTTS (AI4Bharat)	4.0 / 5	150ms	13 Indian langs
Bark	4.1 / 5	5–8s	Multi-lingual
Google Cloud TTS	4.4 / 5	100ms	14 Indian langs
ElevenLabs	4.6 / 5	250ms	Limited Indian

Code

Generate Hindi Audiobook with Coqui XTTS

!pip install -q TTS

from TTS.api import TTS
import os

# XTTS v2 — multilingual, supports Hindi
tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

def text_to_audiobook(chapter_path: str, output_path: str, lang: str = "hi"):
    """Convert one chapter file to an MP3 audiobook segment."""
    with open(chapter_path, "r", encoding="utf-8") as f:
        text = f.read()

    # Chunk by sentence to avoid memory blow-up on long text
    sentences = [s.strip() for s in text.split("।") if s.strip()]
    audio_segments = []
    for i, sentence in enumerate(sentences):
        wav = tts.tts(
            text=sentence,
            language=lang,
            speaker_wav="reference_voice.wav",  # 6-second clip of approved voice
        )
        audio_segments.append(wav)
        if i % 10 == 0:
            print(f"  {i}/{len(sentences)} sentences synthesised")

    # Concatenate and save
    import numpy as np
    full_audio = np.concatenate(audio_segments)
    tts.synthesizer.save_wav(full_audio, output_path)

# Generate the whole textbook
for chapter in os.listdir("./hindi_textbook/"):
    text_to_audiobook(
        chapter_path=f"./hindi_textbook/{chapter}",
        output_path=f"./audiobook/{chapter.replace('.txt', '.wav')}",
        lang="hi",
    )

Ethics

The Voice-Cloning Refusal

The same XTTS model that read Manav's textbooks can clone any voice from a 6-second sample. That's the offer he received — clone a celebrity voice without consent. Why he refused:

Legal risks (India):

Personality rights: Indian courts (Anil Kapoor v. Simply Life India, 2023) recognise a celebrity's right to control their voice and likeness. Cloning without consent invites injunctions and damages.
IT Rules 2021 + 2023 amendment: Synthetic media impersonating real people must be clearly labelled. Defamation via deepfake is criminal under IPC §499/500.
DPDPA 2023: Voice is biometric/personal data. Processing it (training a clone) without consent violates the Act.

Manav's principles for voice synthesis:

Only synthesise voices with explicit, written consent of the speaker.
Never imitate a real, named person without their permission.
Always announce "This is an AI-generated voice" at the start of synthesised audio.
Add an inaudible watermark (e.g., AudioSeal) for provenance.
Reject any request that involves making someone "say something they didn't say".

For the audiobook project, the school had a teacher record 30 minutes of speech and signed a consent form. Manav used that as the reference voice. The students and parents knew the voice was synthetic and consented to its use.

Outcome: 80 hours of Hindi-Urdu audiobooks for 200+ visually impaired students. The teacher whose voice was used became a local celebrity. The YouTube creator went elsewhere — and Manav's principles became the school's official AI policy.

Detection

Detecting Voice Deepfakes

The other side of TTS is detection. Open-source detectors that work in 2026:

AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal graph attention. Best open-source detector.
RawNet3: Lightweight detector that runs on CPU.
AudioSeal (Meta): Watermark embedded in TTS output, recoverable even after re-recording.

from speechbrain.pretrained import SpeakerRecognition
verification = SpeakerRecognition.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb")

# Compare suspect audio to known real samples of the speaker
score, prediction = verification.verify_files(
    "real_speaker_sample.wav", "suspect_audio.wav"
)
print(f"Same speaker probability: {score.item():.3f}")

Detection is an arms race. The right defence is provenance (watermarks + signed metadata) plus media literacy, not detection alone.

📝 Check Your Understanding (8 Questions)

1. What are the two stages of modern TTS?

a) Recording and editing

b) Text → Mel-spectrogram (acoustic model like Tacotron2/FastSpeech2/VITS-encoder), then spectrogram → audio waveform (vocoder like HiFi-GAN/WaveRNN/VITS-decoder)

c) Translation and synthesis

d) Compression and encryption

2. Why does Manav refuse the celebrity voice-cloning offer?

a) The technology is too difficult and would take months

b) It violates Indian personality rights (Anil Kapoor v. Simply Life India, 2023), the IT Rules 2021 synthetic-media labelling requirement, the DPDPA 2023 (voice is biometric data), and basic ethics — making someone 'say what they didn't say' is impersonation

c) The Coqui licence forbids any voice cloning

d) The offered payment is too low

3. Why does the school audiobook project not raise the same ethical issues?

a) School projects are exempt from voice-cloning law

b) The teacher whose voice was used gave written, informed consent; students and parents knew the voice was synthetic; and the use is for an accessibility purpose, not impersonation — these conditions make the use ethical and legal

c) Audiobooks are exempt from the IT Rules synthetic media provisions

d) The teacher does not have personality rights because she is not a celebrity

4. What does VITS do differently from a Tacotron2 + HiFi-GAN pipeline?

a) VITS only supports English

b) VITS combines acoustic model and vocoder into a single end-to-end network trained jointly with a variational objective — simpler pipeline, fewer cascading errors, and typically higher quality

c) VITS uses transformer-only architecture without convolutions

d) VITS requires 10× more training data than Tacotron2

5. Why is detection alone an insufficient defence against voice deepfakes?

a) Detection models are illegal to deploy in India

b) Detection is an arms race — every new detector quickly faces a generator that defeats it; the right defence stack is provenance (watermarks + signed metadata) + clear labelling + media literacy education, with detection as one layer

c) Detection requires more GPU memory than generation

d) Detection only works on English audio

6. What is AudioSeal and why does it matter?

a) It is a hardware device that physically marks audio recordings

b) An open-source watermarking method by Meta that embeds an inaudible watermark in TTS output; the watermark is recoverable even after MP3 compression and re-recording, providing provenance proof that the audio was AI-generated

c) It encrypts audio so only authorised users can play it

d) It is the legal name of India's anti-deepfake law

7. Why does Manav chunk the textbook by sentence (split on '।') before synthesis?

a) The Coqui API rejects inputs longer than 100 characters

b) Long text inputs cause GPU OOM and quality degradation in TTS models; chunking at sentence boundaries (the Devanagari full stop '।') gives natural prosody breaks and fits each chunk in memory

c) It allows the audio to be played in parallel

d) It is required to convert Devanagari to ASCII

8. What single principle from Manav's checklist should every student building voice-AI systems adopt?

a) Always use the highest-quality voice available

b) Only synthesise voices with explicit, written consent from the speaker — and never imitate a real, named person without their permission. This single rule prevents almost every harmful voice-AI use case

c) Always use a paid TTS service rather than open-source

d) Always upload generated audio to a public server for transparency

← Lesson 5: Diffusion Models Lesson 7: Recommender Systems →

Speech AI: TTS & Voice Cloning Ethics 🎙️

Class 12 Lesson 6 - Speech AI: TTS & Voice Cloning Ethics

📝 Check Your Understanding (8 Questions)