Fish Audio Review: I Created 3 Characters From 1 Voice Clone Using Emotion Tags

I cloned my voice with a 15-second audio sample. Then I made that single clone play three completely different characters — a narrator, a villain, and a scared child — using nothing but bracket tags. After 48 hours with Fish Audio S2 Pro, I’m convinced that emotion tags are the most underrated feature in AI voice. But the free tier has one trap that almost stopped me cold.

I’ll cut to the chase: if you’re creating audiobooks, game characters, or voiceover content, this guide covers every emotion tag, pricing tier, and workflow trick I discovered. Here’s what you need to know before you start.

Fish Audio S2 Pro — At a Glance

Verdict: Best emotion control in AI voice. Period.

Rating4.5 / 5
PriceFree – $75/month (API: $15 per 1M characters)
Best ForAudiobooks, audio dramas, game NPCs, multilingual voiceover
Emotion Tags15,000+ free-form tags with 93.3% activation rate
Latency~100ms time-to-first-audio
Key LimitationFree tier: 500 char limit, no commercial rights

What Is Fish Audio S2 Pro? (The #1 Ranked Voice AI You Haven’t Heard Of)

Essentially, Fish Audio S2 Pro is a text-to-speech model built on a Dual-Autoregressive architecture — 4.4 billion parameters in total. However, what matters for practical use is simpler: it currently ranks #1 on the TTS-Arena2 leaderboard.

In fact, its Bradley-Terry score of 3.07 is 1.7 times higher than ElevenLabs V3’s score of 1.80. On top of that, Fish Audio open-sourced the S2 Pro model weights on March 8, 2026 — so you can even self-host it for free on HuggingFace.

The model supports 83 languages with Tier 1 quality in English, Chinese, and Japanese. Meanwhile, its real-time factor of 0.195 means it generates audio faster than real-time, with roughly 100ms latency to first audio. I found that voice cloning from just a 10–30 second sample produced surprisingly accurate results.

Still, none of that is why I’m writing this review. The real story is the emotion tag system — and once you see it in action, it changes how you think about AI voice entirely. Understanding the tags is the first step, so here’s the complete cheat sheet.

The Complete Emotion Tags Cheat Sheet (Copy-Paste Ready)

Before you start generating audio, you’ll need this reference. Fish Audio S2 Pro supports over 15,000 unique tags — but the core set below covers 90% of use cases. In particular, note the syntax: tags go inside square brackets [like this] and affect all text that follows until the next tag or sentence end.

Core Emotions

TagEffectBest For
[happy]Warm, upbeat toneGreetings, positive narration
[sad]Somber, slower deliveryEmotional scenes, empathy
[angry]Intense, forcefulVillain dialogue, arguments
[excited]High energy, fast paceAnnouncements, reveals
[surprised]Startled, reactivePlot twists, discoveries
[sarcastic tone]Dry, ironic inflectionComedy, editorial tone
[nervous]Shaky, hesitantSuspense, anxious characters

Vocalizations and Sound Effects

TagEffect
[laugh] / [chuckle]Inserts natural laughter
[whisper]Breathy, quiet delivery
[sigh] / [inhale] / [exhale]Breath sounds between lines
[panting]Out-of-breath effect
[tsk]Disapproval click sound

Prosody and Speed Control

TagEffect
[pause] / [short pause]Silence between phrases
[emphasis]Stresses the next word or phrase
[volume up] / [shouting]Louder projection
[speaking slowly and clearly]Measured, deliberate pace
[pitch up]Higher voice register

Notably, you can also use free-form descriptive phrases as tags. For example, instead of just [angry], try [whisper in a threatening low voice] — the 93.3% tag activation rate means most natural language descriptions work on the first attempt. Knowing the tags is one thing, though — let me show you what happens when you stack them creatively.

I Created 3 Characters From 1 Voice Clone in 48 Hours — Here’s the Exact Workflow

Every reviewer tests Fish Audio by generating a single voiceover and comparing quality. Instead, I used S2 Pro’s emotion tags to create an entire 3-character audio drama — a narrator, a villain, and a child — all from one cloned voice.

Here’s exactly how I did it. First, I recorded a 15-second voice sample in my normal speaking tone and uploaded it to Fish Audio’s voice cloning tool. Once the “High-Quality” clone finished processing (about 5 minutes), I had a single base voice to work with.

Then I created three distinct characters using only tag stacking:

Narrator: [speaking slowly and clearly] + [professional broadcast tone]

Villain: [angry] + [speaking slowly] + [whisper in a threatening low voice]

Child: [pitch up] + [excited] + [whisper in a small voice]

The results genuinely surprised me. Each character sounded distinct enough that a blind listener couldn’t tell they came from the same voice clone. Indeed, the villain’s [whisper in a threatening low voice] tag produced a quality I hadn’t heard from any other AI voice tool — the kind of menacing tone that usually requires a trained actor.

Total cost on the free tier: $0. Equivalent voice actor cost for three distinct characters: $500+. In my experience, the emotion tag system isn’t just a feature — it’s a character creation engine that completely changes the economics of audio production. But creating characters is the advanced technique — let me walk you through the basics first.

Step-by-Step: Your First Emotion-Tagged Voiceover in 5 Minutes

This is the fastest way to get started. Even if you’ve never used a TTS tool before, you’ll have a working emotion-tagged audio clip in under 5 minutes.

Step 1: Create an Account and Choose a Voice

First, head to fish.audio and sign up for a free account. Next, browse the voice library or clone your own voice with a 10–30 second audio sample.

Step 2: Open the Text-to-Speech Editor

Once logged in, click “Create” and select the Text-to-Speech option. Then paste or type your script in the text box.

Step 3: Add Emotion Tags

At this point, insert bracket tags wherever you want the emotion to change. For example:

The door opened slowly. [whisper] Something was wrong.
[pause] Then the lights went out. [shouting] RUN!

Step 4: Configure and Generate

After that, select your output format (MP3 or WAV) and preferred bitrate (64, 128, or 192 kbps for MP3). Then click “Generate” and wait about 2–3 seconds for the result.

Step 5: Preview, Adjust, and Download

Finally, preview the audio directly in the browser. If the emotion doesn’t sound right, adjust the tag — for instance, swap [angry] for [frustrated] and regenerate. The ~100ms latency means iterations are practically instant.

Generating your first tagged voiceover is the easy part. However, if you want real precision, you need to understand word-level control — because that’s where Fish Audio truly separates itself from every competitor.

Word-Level Control: How to Change Emotion Mid-Sentence

Most TTS tools let you set one emotion per sentence or paragraph. However, Fish Audio S2 Pro lets you change emotion mid-sentence — and that’s where the real power lives.

Here’s how it works. Tags affect only the text that follows them, until a new tag appears or the sentence ends. In other words, you get word-level granularity. For example, compare these two inputs:

I'm so happy [sad] but I have to leave now.

Result: “I’m so happy” plays cheerful — then “but I have to leave now” shifts to a somber tone within the same sentence.

[excited] We did it! [pause] [whisper] But at what cost?

Result: Excitement into silence, then a quiet, somber whisper. One line, three distinct emotional beats.

2026 DATA POINT

Fish Audio S2 Pro: #1 on TTS-Arena2

Bradley-Terry score of 3.07 — 1.7x higher than ElevenLabs V3 (1.80). It’s not just cheaper. It’s measurably better. And the model weights are free on HuggingFace.

Specifically, you can also use free-form natural language descriptions instead of simple emotion words. In my testing, phrases like [whispering in a tired, defeated voice] or [professional news anchor reading breaking news] worked with a 93.3% activation rate. Tag precision is impressive, but the real question most creators ask is: how does this stack up against ElevenLabs?

Fish Audio S2 Pro vs ElevenLabs: The Emotion Control Gap

Let’s be honest: ElevenLabs has been the default voice AI tool for years. However, when it comes to emotion control specifically, Fish Audio S2 Pro takes a fundamentally different approach.

FeatureFish Audio S2 ProElevenLabs
Control MethodInline [tag] natural languageGlobal sliders (Stability/Style)
GranularityWord-level (mid-sentence)Sentence/paragraph level
Tag Variety15,000+ free-formLimited preset sliders
Latency~100ms TTFA~75ms (Flash v2.5)
TTS-Arena2 Score3.07 (#1)1.80
API Price (per 1M chars)$15$165

In practice, the difference comes down to this: ElevenLabs gives you sliders that adjust the entire output globally. Meanwhile, Fish Audio gives you inline tags that work at the word level. For audiobook narrators who need characters that shift emotion mid-dialogue, that word-level control is a game-changer.

That said, ElevenLabs still wins on simplicity and ease of use. If you need quick, clean voiceovers without deep emotion customization, ElevenLabs is still an excellent choice — try it here. Nonetheless, the technical gap in emotion control is significant. Understanding that gap is useful, but the pricing gap tells an even more interesting story — one that explains why Fish Audio can afford to be this good.

Why Fish Audio Is 10x Cheaper Than ElevenLabs (And Always Will Be)

Fish Audio’s API costs $15 per million characters. ElevenLabs charges $165 for the same volume. That’s not just cheaper — it’s a completely different pricing tier.

Look: most people stop at “it’s 10x cheaper” and move on. But the reason WHY it’s cheaper reveals everything about its future.

Fish Audio open-sourced their S2 Pro model weights on HuggingFace. So the AI model itself is free — you can literally download and run it yourself. Therefore, they’re not charging you for voice quality. They’re charging for the platform infrastructure around a free model.

In other words, their business model isn’t selling a proprietary AI engine. Instead, it’s building an ecosystem that developers build ON. Every third-party app that integrates Fish Audio’s API grows their moat. Similarly, every self-hosted instance feeds community improvements back to the core model.

Here’s what that means for pricing. ElevenLabs must charge premium rates because they’re protecting a proprietary model that costs millions in R&D. Since Fish Audio’s model is already free, they can price purely on infrastructure costs — which only decrease over time as usage scales.

The bottom line? Fish Audio won’t just stay cheaper. The gap will widen. Because ElevenLabs competes on model exclusivity (expensive to maintain), while Fish Audio competes on ecosystem scale (which gets cheaper with each new user). Pricing strategy tells the big picture, but let me break down exactly what each tier costs for your specific use case.

Pricing Breakdown: Free vs Plus vs Pro — Which Tier Do You Actually Need?

Does this sound familiar? You sign up for a free tool, love it, then spend 30 minutes comparing plans without knowing which one actually fits your workflow. Here’s the simplified breakdown.

PlanMonthlyCreditsAudio TimeCommercial Use
Free$08,000/mo~7 minutesNo
Plus$11250,000/mo~200 minutesYes
Pro (Team)$752,000,000/mo~27 hoursYes
APIPay-as-you-go$15 per 1M chars~12 hours per 1MYes

For most individual creators, the Plus plan at $11/month is the sweet spot. It gives you 250,000 credits (roughly 200 minutes of audio), commercial rights, and enhanced voice cloning. However, if you’re running a team or producing long-form audiobooks, the Pro plan at $75/month provides 2 million credits with shared workspace access for 3 members.

Here’s exactly what to do next: test the free tier first to learn the emotion tags. But be aware of the limitations I discovered the hard way — because the free tier has some surprising restrictions.

What I Don’t Like About Fish Audio (The Free Tier Trap)

Remember that free tier trap I mentioned in the intro? Here it is — along with three other honest complaints from my 48-hour test.

First, the 500-character limit on the free tier is brutal. That’s roughly 2–3 sentences per generation. For testing individual emotion tags, it works fine. But for anything production-length, you’ll hit the wall immediately.

Second, the free tier offers no commercial use rights. In other words, even if you create something amazing, you legally can’t use it in monetized content. I discovered this restriction buried in the terms of service halfway through my test — which nearly derailed my audio drama project.

Third, the UI is cluttered and developer-focused. For non-technical creators, the interface feels overwhelming compared to ElevenLabs’ cleaner design. Specifically, finding the right settings panel requires more clicks than it should.

Finally, voice cloning quality takes a noticeable jump when you upgrade from free to Plus. The enhanced cloning on paid tiers produces crisper, more natural results. To be fair, even the free cloning outperforms most competitors — but it’s clearly throttled to push upgrades.

Despite these drawbacks, Fish Audio S2 Pro is still the best value in AI voice for creators who need emotion control. The question is whether it fits YOUR specific workflow.

Who Should (and Shouldn’t) Use Fish Audio S2 Pro?

✓ Great Fit

  • Audiobook and audio drama creators
  • Game developers (real-time NPC voices)
  • Localization teams (83 languages)
  • Faceless YouTube channels
  • Developers building voice-enabled apps

✗ Not Ideal

  • Beginners wanting plug-and-play simplicity
  • Basic flat narration (use Google/Azure free tiers)
  • Music or singing generation
  • Users who need a large built-in voice library
  • Anyone who can’t tolerate a complex UI

For a deeper look at how Fish Audio compares to established voice AI leaders, see my Murf AI vs ElevenLabs comparison. If you need voice AI specifically for faceless YouTube channels, check my complete faceless YouTube toolkit guide. Meanwhile, if you want a simpler TTS tool without the learning curve, read my Speechify review. And for podcast-focused audio editing, see my Descript vs Riverside comparison.

Before you start creating, let me answer the four questions I hear most about Fish Audio S2 Pro.

Frequently Asked Questions

Is Fish Audio S2 Pro really better than ElevenLabs?

For emotion control and pricing, yes — measurably so. Fish Audio scores 3.07 on the TTS-Arena2 Bradley-Terry scale versus ElevenLabs’ 1.80. However, ElevenLabs still offers a simpler interface and a larger built-in voice library. In short, Fish Audio wins on quality and price, while ElevenLabs wins on ease of use.

Can I use Fish Audio emotion tags on the free tier?

Yes, all 15,000+ emotion tags work on the free tier. However, the free plan limits you to 500 characters per generation and provides no commercial use rights. So you can test every tag, but you can’t use the output in monetized projects without upgrading to Plus ($11/month) or higher.

How many languages does Fish Audio S2 Pro support?

Fish Audio S2 Pro supports 83 languages, with Tier 1 quality in English, Chinese, and Japanese. Notably, it also supports cross-lingual voice cloning — meaning an English voice clone can speak fluent Japanese. Though some less common CJK languages can be inconsistent in quality.

Can I self-host Fish Audio S2 Pro for free?

Yes. Since Fish Audio open-sourced the S2 Pro model weights on HuggingFace (March 8, 2026), you can download and run the model on your own hardware. This is particularly useful for developers who need unlimited generation without per-character costs. Of course, self-hosting requires significant GPU resources and technical knowledge.




Leave a Comment