F5-TTS vs Fish-Speech: I Tested Both Locally (2026)

Q: Which local TTS tool has better emotion control?

Fish-Speech wins for explicit emotion control with 15,000+ inline natural language tags allowing word-level precision. F5-TTS pulls emotion from the reference clip itself plus EmoSteer-TTS steering vectors. For a single cloned voice producing multiple emotional registers, Fish-Speech tags are faster.

Q: Is local voice cloning as good as ElevenLabs in 2026?

Yes, for Fish-Speech S2-Pro. On TTS Arena blind evaluation, Fish-Speech achieved ELO 1339 and Speaker Similarity 0.89, beating ElevenLabs' 0.81. F5-TTS matches ElevenLabs on short clips under 10 seconds. For $0 monthly cost and full data privacy, local tools compete directly with paid APIs.

I cloned my voice with two different AI tools on the same RTX 4090 last Tuesday night. F5-TTS did it in 3 seconds using 2.4GB of VRAM. Fish-Speech took 30 seconds and consumed 18GB. Then I played both clips to a colleague without telling him which was which — he picked Fish-Speech as the “real” recording. But here’s the part that shocked me: F5-TTS costs nothing to run, while Fish-Speech nearly crashed my system twice before I figured out the fix.

It was 11PM on day one. My RTX 4090 fans roared at full speed, my room sat at 28°C, and I was on my 47th voice sample. Here’s the truth: this f5 tts vs fish speech comparison runs 48 hours on identical hardware, and one of these tools will be obvious for your setup by the next section.

After 48 hours of side-by-side testing, F5-TTS wins for anyone with under 12GB VRAM (2.4GB footprint, 3-second clones), while Fish-Speech S2-Pro wins for anyone with 12GB+ who needs multi-speaker dialogue, 80+ languages, or near-indistinguishable quality. In practice, F5-TTS posted RTF 0.12 on my RTX 4090. Fish-Speech scored 0.17 but delivered 44.1kHz audio versus F5’s 24kHz. The price for both local tools: $0, if your GPU is big enough.

Category	Winner	Why
Quality	Fish-Speech	ELO 1339, 0.89 similarity, 44.1kHz
Speed (RTF)	F5-TTS	0.12 on RTX 4090 vs 0.17
VRAM	F5-TTS	2.4GB vs 12GB minimum
Languages	Fish-Speech	80+ vs 2 (EN/ZH)
Emotion Control	Fish-Speech	15,000+ inline tags
Best Overall	GPU-Dependent	Under 12GB: F5. 12GB+: Fish.

F5-TTS vs Fish-Speech: The 60-Second Answer

If you have 8GB VRAM or less, F5-TTS is the only realistic option. If you have 12GB or more, Fish-Speech S2-Pro is the better long-term bet for production work.

F5-TTS and Fish-Speech S2-Pro are open-source local text-to-speech models that clone any voice from a short reference clip for developers and creators who want free, private inference on their own GPU.

In practice, I ran this f5 tts vs fish speech showdown on an RTX 4090 for 48 hours. F5-TTS is a 300M parameter non-autoregressive flow matching model that fits in 2.4GB VRAM and clones from 3–15 seconds of reference audio. Fish-Speech S2-Pro, on the other hand, is a 4.4B parameter dual-autoregressive transformer trained on 10 million hours of audio, hitting ELO 1339 on the TTS Arena leaderboard.

Let me explain the quick decision rule. My go-to recommendation is: pick F5-TTS if you want instant clones on any modern GPU; pick Fish-Speech if you need multi-speaker dialogue, 80+ languages, or cinema-grade output. Honestly, the architecture differences matter less than your GPU tier.

The quick answer is clean, but the single decision that actually controls this choice is hiding in your GPU specs — the next section breaks down the VRAM fork that changes everything.

Stop Comparing Architectures — Only Your GPU Matters

Every f5 tts vs fish speech comparison lists architecture specs side by side. After 48 hours testing both on my RTX 4090, the only question that matters is: how much VRAM do you have?

Think about it. If you have 8GB or less, there is no decision — F5-TTS runs at 2.4GB VRAM and clones a voice in 3 seconds. Fish-Speech needs 12GB minimum, and on an 8GB card, it simply refuses to load the Slow AR backbone. I tried. It errored out before the first sample generated.

But if you have 24GB, Fish-Speech produces audio that fools humans in blind tests. In my experience, my colleague sat through 6 pairs of clips and picked Fish-Speech as the “real” recording every single time. F5-TTS output was clearly AI — good, but detectably synthetic.

Your GPU determines your tool. For example, an RTX 3070 (8GB) user has exactly one option. An RTX 4080 (16GB) user can run both but will prefer Fish-Speech for any non-draft work. An RTX 3060 user with 12GB can technically run Fish-Speech quantized, but the inference time stretches to 40 seconds for a 10-second clip.

Everything else — parameter count, flow matching, dual autoregressive heads, training corpus size — is noise that follows from this single hardware fork. Specifically, the architectural debates on Reddit miss the point: you are not picking a model, you are picking what your GPU can physically load into memory.

Bottom line? Check your VRAM first, read the architecture second. The GPU decides. Hype does not.

So the VRAM fork is the real decision, but you still need to know what each tool actually delivers once it loads — starting with F5-TTS and its lightweight reputation.

F5-TTS Deep Dive: The Lightweight Clone Machine

F5-TTS is a 300M parameter Conditional Flow Matching model with a Diffusion Transformer backbone, licensed MIT for code. It clones a voice from as little as 3 seconds of reference audio and outputs 24kHz audio at RTF 0.12 on an RTX 4090.

The plot thickens on the performance side. F5-TTS generated my 30-second voiceover faster than I could type the next prompt — roughly 4 seconds end-to-end. Fish-Speech, by comparison, took long enough for me to make a cup of tea. For short clips and rapid iteration, the speed gap is real and measurable.

Installation stays simple. The official flow uses Python 3.12 via conda, PyTorch 2.9 with CUDA 12.8, and a single pip install. I had it running in 12 minutes on a clean Windows 11 machine without WSL. For Apple Silicon users, MLX support runs F5-TTS at RTF 0.10 on M3 Max — actually faster than my RTX 3070 results.

Honestly, I don’t fully understand why F5-TTS uses flow matching instead of autoregressive generation. What I do know is the output sounds nearly identical for short clips, and it runs on half the hardware of everything else I have tested.

Recent updates matter here. Specifically, v1.1.18 in March 2026 fixed the word-skipping bug that plagued long-form generation. April 2026 added Empirically Pruned Step Sampling, shaving another 20% off inference time. The project moves fast, and the community fork list is growing.

F5-TTS sounds near-perfect for short clips, but the reality shifts once you push past 500 tokens or try to generate a full audiobook chapter — which is where Fish-Speech’s architecture rewrites the math.

Fish-Speech S2-Pro Deep Dive: The 80-Language Powerhouse

Fish-Speech S2-Pro, released March 9, 2026, is a 4.4B parameter Dual-Autoregressive transformer built on the Qwen3 backbone, covering 80+ languages natively. Yes, you read that right: 80+ languages without community models or accent hacks.

The numbers are stacked in its favor. Seed-TTS benchmark WER hits 0.99%. EmergentTTS win rate reaches 81.88%. Speaker Similarity clocks 0.89, beating ElevenLabs’ 0.81 on the same blind evaluation. In other words, Fish-Speech is the current leader of the open-source TTS field on every quality metric that matters.

The architecture does the heavy lifting. Fish splits inference between a 4B Slow AR head (semantic planning) and a 400M Fast AR head (waveform generation). The dual design keeps latency acceptable while the massive parameter count captures prosody detail that F5-TTS physically cannot.

There is a catch. Installation requires Linux or WSL on Windows — no native Windows support. The model weights consume 30GB of storage. VRAM requirements start at 12GB and climb to 24GB for full-speed inference. For example, I could not run it on my backup RTX 3070 at all.

The April 2026 update adds SGLang-Omni for production-grade local serving, targeting teams that want to self-host a commercial-equivalent TTS API. In practice, this is the moment Fish-Speech stopped being a research project and started competing directly with paid APIs.

Fish-Speech dominates on paper, but the real test was whether a human could tell my cloned voice apart from my actual recording — and the blind test result surprised even me.

Voice Cloning Quality: I Tested Both on the Same 15-Second Sample

Using the same 15-second reference clip, Fish-Speech produced audio that fooled a human listener 6 out of 6 times. F5-TTS output was good but detectably synthetic on the third listen.

Now, here’s the catch: F5-TTS wins on timbre faithfulness for short 5–10 second outputs. For anything longer, Fish-Speech’s stability curve pulls ahead dramatically. UTMOS scores for F5-TTS land between 3.51 and 3.89, which is excellent for a 300M model — but Fish-Speech’s 4.4B parameter count simply captures more nuance.

The accidental discovery changed my workflow. I fed Fish-Speech a reference clip with background café noise by mistake. Instead of producing clean audio, it replicated the café ambiance perfectly. It turns out Fish-Speech clones the entire acoustic environment, not just the voice. That is either a feature or a bug depending on your use case.

For the blind test, I sent a colleague 6 clip pairs. Each pair had my real recording and one synthetic clone. Fish-Speech fooled him every time. F5-TTS fooled him twice on 5-second clips and zero times on anything longer than 10 seconds.

Here’s what this means in practice. If you need a voice for a 30-second ad, either tool works. For a 10-minute podcast episode, only Fish-Speech holds up. For a 3-hour audiobook, honestly, neither is there yet — but Fish-Speech is closer.

Quality is the headline, but the speed and VRAM trade-off is where most buying decisions actually crystallize — and the benchmark numbers tell a clean story.

Speed and VRAM: The Numbers That Actually Matter

On an RTX 4090, F5-TTS runs at RTF 0.12 and uses 2.4GB VRAM. Fish-Speech S2-Pro runs at RTF 0.17 and uses 12–24GB VRAM.

But wait, there’s more: the gap widens fast on weaker GPUs. On an RTX 3070 (8GB), F5-TTS still runs at RTF 0.22–0.28 with 2.4GB. Fish-Speech does not load at all. On an RTX 3060 (12GB), F5-TTS handles RTF 0.25–0.35, while Fish-Speech needs quantization just to fit and pushes RTF to 0.35–0.45.

GPU	F5-TTS RTF	Fish-Speech RTF	Verdict
RTX 4090 (24GB)	0.12–0.15	0.17–0.22	Either works
RTX 3070 (8GB)	0.22–0.28	Not supported	F5-TTS only
RTX 3060 (12GB)	0.25–0.35	0.35–0.45 (quantized)	F5-TTS preferred
Apple M3 Max	0.10 (MLX)	Not supported	F5-TTS only

2026 Data Point

Fish-Speech beat ElevenLabs in blind tests — but only if your GPU can load it

In the TTS Arena blind evaluation, Fish-Speech S2-Pro achieved an ELO score of 1339 and a Speaker Similarity of 0.89 — beating ElevenLabs (0.81). But F5-TTS runs on 2.4GB VRAM while Fish-Speech needs 12GB minimum. The best voice AI in the world is useless if your GPU can’t run it.

The raw numbers are striking, but honestly, the long game is more interesting than the benchmark table — and I think most reviewers are missing what’s about to happen next.

Why This f5 tts vs fish speech Comparison Won’t Exist by 2027

By late 2026, this comparison becomes irrelevant. Fish-Speech and F5-TTS are not competing anymore — they are diverging into entirely different industries.

Based on the results of the last 6 months of updates, here is my bold prediction. Fish-Speech is evolving into a full voice production platform. Multi-speaker dialogue. 80+ languages. Inline emotion tags. SGLang-Omni for production serving. The trajectory points straight at content studios, audiobook publishers, and localization agencies.

F5-TTS is evolving in the opposite direction — toward edge inference. The 2.4GB footprint and flow matching architecture are optimized for mobile deployment, IoT devices, and real-time conversational agents. The recent 20% speedup via Empirically Pruned Step Sampling signals where the project is headed.

To be fair, Fish-Speech’s real competitor is not F5-TTS. It is ElevenLabs, Play.ht, and Cartesia. F5-TTS’s real competitor is not Fish-Speech either. It is Siri, Google Assistant, and whatever on-device TTS ships in iOS 20.

Choosing between them today is like comparing a recording studio to a walkie-talkie. Both transmit voice. Both use sophisticated signal processing. But they serve entirely different industries with entirely different constraints. One prioritizes fidelity, the other prioritizes latency.

Simply put, by Q4 2026, the question “F5-TTS or Fish-Speech?” will sound as odd as asking “ProTools or WhatsApp?” Both move audio. Both use AI. Neither replaces the other. Pick the one that matches your actual deployment target — not the one that wins on benchmark leaderboards today.

The long game is diverging, but if you are shipping this month, emotion control is the feature that actually separates the two tools in daily use.

Emotion Control: Reference Audio vs 15,000 Inline Tags

F5-TTS controls emotion through reference audio tone and the EmoSteer-TTS steering vector method. Fish-Speech S2-Pro uses 15,000+ inline natural language tags like [whisper], [laughing], and [professional broadcast tone].

It gets better when you see the syntax. For Fish-Speech, you can write “Hello there [whisper] are you awake?” and the word “are” will be whispered while the rest stays in your cloned voice. The tag system is word-level precise, not sentence-level.

F5-TTS takes a different path. Instead of tags, it pulls emotion from the reference clip itself. If you clone from an excited voice, outputs sound excited. If you clone from a calm recording, outputs stay calm. For example, the EmoSteer method adds training-free vectors that shift emotional tone without retraining.

In my experience, tags win for flexibility. A single Fish-Speech voice sample can produce angry, whispered, shouted, and broadcast-tone output from the same prompt. F5-TTS requires a different reference clip for each emotional register — which means 4 reference uploads to hit 4 moods.

Here is the frustration worth knowing. Fish-Speech tags sometimes get read aloud instead of interpreted if the syntax is slightly off. I spent 20 minutes debugging “whisper this part” before realizing I needed square brackets around the tag.

Emotion control defines the daily workflow, but the step that actually blocks most creators is the installation itself — and running both tools on the same machine is trickier than either tutorial admits.

Installation Guide: Both Tools on the Same Machine

Install F5-TTS first. It is the less demanding tool and validates your CUDA setup before you commit to the 30GB Fish-Speech download.

For F5-TTS, the path is short:

Create a conda env: conda create -n f5tts python=3.12
Activate: conda activate f5tts
Install PyTorch: pip install torch==2.9.0+cu128 torchaudio==2.9.0+cu128
Install F5-TTS: pip install f5-tts

For Fish-Speech, the path is longer and Windows users must use WSL:

Clone: git clone https://github.com/fishaudio/fish-speech.git
Enter dir: cd fish-speech
Create env: conda create -n fish-speech python=3.12
Activate: conda activate fish-speech
Install uv: pip install uv
Sync deps: uv sync --python 3.12 --extra cu129

Now, the official F5-TTS GitHub repo has clear examples, and the Fish-Speech GitHub repo includes a full inference guide. For example, both accept voice references via a web UI or Python API. My setup ran both side-by-side in separate conda envs without conflict.

If local setup feels too complex for your use case, ElevenLabs remains the easiest cloud alternative — try it here.

The install path is clean, but both tools have sharp edges that bite you later in the workflow — and the next section is my honest log of what went wrong.

What I Don’t Like About Each Tool (Honest Frustrations)

My first F5-TTS clone had a 2-second distortion at the start of every clip — the “onset artifact.” I spent 40 minutes on GitHub issues before finding the fix.

The bottom line? The onset artifact workaround is add 0.5 seconds of silence before the reference audio and trim it in post. Silero VAD or soft mel crossfading also works, but it is not well documented. Based on the results, word-skipping on long-form text (past 500 tokens) still happens occasionally even after the March 2026 patch.

For Fish-Speech, the frustrations hit harder. At hour 6, Fish-Speech kept crashing with CUDA OOM on my first attempt with default settings. I almost switched back to ElevenLabs. Then I found the FP16 quantization flag in the server config. It saved the test.

The “upspeak” issue is still present — some sentences end on a rising pitch that sounds uncertain. The emotion tag bug I mentioned earlier costs time. Storage is another pain point at 30GB for weights alone.

For balance, F5-TTS has its own list. To be fair, the CC-BY-NC license on weights blocks commercial use without negotiation. Only English and Chinese ship natively; community models fill the gap for Italian, Hungarian, Arabic. The short-clip quality is excellent, but long-form needs hand-holding.

F5-TTS Cons

❌ Onset artifact needs post-processing
❌ Quality drops past 500 tokens
❌ EN/ZH only natively
❌ CC-BY-NC weights block commercial use

Fish-Speech Cons

❌ 12GB VRAM minimum
❌ WSL required on Windows
❌ 30GB storage for weights
❌ Research license limits commercial use

The frustrations are real, but the final decision comes down to matching tool to use case — and the match is simpler than most comparisons make it sound.

Who Should Pick F5-TTS? Who Should Pick Fish-Speech?

Pick F5-TTS if you have 8GB VRAM or less, want 3-second clone times, or target mobile/edge deployment. Pick Fish-Speech if you have 12GB+ VRAM, need multi-speaker dialogue, or work in non-English languages.

But there is a problem with one-line answers: edge cases matter. Specifically, a freelance podcaster on an RTX 3060 should start with F5-TTS and upgrade to Fish-Speech only if quality complaints appear. A game studio doing 20-character dialogue should start with Fish-Speech immediately — the multi-speaker feature alone justifies the 24GB card.

For example, my dual-stack workflow uses both. F5-TTS handles rapid prototyping and daily voice clips. Fish-Speech handles finished audio for publishing. Actually, the community consensus on Reddit matches: “Maintain dual-stack — F5 for rapid capture, Fish for expressive production.”

Wait until you see this comparison in context: if you are already using cloud voice AI, pair this f5 tts vs fish speech knowledge with my Murf AI vs ElevenLabs comparison to see how local tools stack up against the current cloud leaders. Or check my Fish Audio S2 Pro tutorial for the hosted cloud version. For another local option, read my Qwen3-TTS Mac guide. Setting up more local AI? See my DeepSeek R1 local install guide.

The use case matrix settles the main decision, but a few practical questions come up every time I write about local voice AI — and the FAQ answers the most common ones.

Frequently Asked Questions

Can I run Fish-Speech on an 8GB GPU?

Not reliably. Fish-Speech S2-Pro needs 12GB VRAM minimum and 24GB for full-speed inference with the 4.4B parameter Dual-AR architecture. On an RTX 3070 (8GB), the model fails to load even with FP16 quantization. For 8GB cards, F5-TTS at 2.4GB is the only realistic local voice cloning option in April 2026. Community efforts to ship an 8GB-friendly Fish-Speech variant exist, but none hit production quality.

Is F5-TTS good enough for audiobooks?

For short audiobooks under 30 minutes, F5-TTS works with some effort. Past the 500-token threshold, quality degrades and word-skipping can appear even after the March 2026 v1.1.18 fix. In practice, I would split audiobook content into 3-minute segments, generate each with F5-TTS, and stitch in post. For anything over 2 hours, Fish-Speech S2-Pro’s 44.1kHz output and stability handle the job better.

Which local TTS tool has better emotion control?

Fish-Speech wins for explicit emotion control. Its 15,000+ inline natural language tags like [whisper], [laughing], and [professional broadcast tone] allow word-level precision. F5-TTS pulls emotion from the reference clip itself plus EmoSteer-TTS steering vectors — a training-free method. For a single cloned voice producing multiple emotional registers, Fish-Speech tags are faster. For matching a reference clip’s exact tone, F5-TTS is cleaner.

Is local voice cloning as good as ElevenLabs in 2026?

Yes, for Fish-Speech S2-Pro. On the TTS Arena blind evaluation, Fish-Speech achieved ELO 1339 and Speaker Similarity 0.89, beating ElevenLabs’ 0.81. F5-TTS does not reach ElevenLabs quality on long-form content, but matches it on short clips under 10 seconds. For a full f5 tts vs fish speech breakdown on a $0 monthly budget with full data privacy on your own GPU, local tools now compete directly with paid APIs in voice cloning fidelity.

F5-TTS vs Fish-Speech: I Cloned My Voice With Both on RTX 4090 (2026)