Skip to content
WaifuStack
Go back

DeepSeek vs Claude vs Gemini for Roleplay: Real-World Benchmarks from Production

There are plenty of LLM benchmarks comparing models on coding and math. There are almost none comparing them on roleplay quality.

We run 5 different models in production across Suzune, each assigned to specific roles based on months of testing. This isn’t a synthetic benchmark — it’s what actually happens when real users interact with AI characters for hours at a time.

Table of contents

Open Table of contents

The Models We Use (And Why)

ModelRole in SuzuneWhy This Model
DeepSeek V3.2Primary chatBest cost/quality/freedom ratio
Claude Haiku 4.5Quality rewrite + fallbackBest prose quality
Gemini 2.5 FlashNPC directionCreative, cheap, NSFW-tolerant
Gemini 2.0 FlashNPC rewriteCheapest per token
GLM-5Scene descriptionsBest at atmosphere and world-building

No single model is “the best.” Each excels at something different.


DeepSeek V3.2: The Workhorse

Role: Primary model for all character conversations

Strengths

Weaknesses

The Quirks You Need to Know

DeepSeek V3.2 has some unique behaviors that require engineering workarounds:

1. Tokenization Glitches

DS3.2 sometimes splits Japanese words incorrectly:

社長 (president) → 社long
部長 (department head) → 部long

We have a cleanup function that catches these:

text = text.replace("社long", "社長")
text = text.replace("部long", "部長")

2. Tool Calls as Plain Text

DS3.2 sometimes outputs function calls as plain text instead of structured tool calls:

generate_image{"expression": "smiling", "scene": "café"}

We built a parser that detects and extracts these, converting them to proper tool calls.

3. NSFW Self-Censorship When Tools Are Active

Interesting one: DS3.2 is more likely to self-censor NSFW content when tool definitions are present in the prompt. Our workaround: if we detect empty responses in NSFW context, we retry without tool definitions.

4. Repetition Loops

Under certain conditions, DS3.2 gets stuck repeating short phrases (e.g., the same word 50 times). We truncate any phrase repeated more than 3 consecutive times.

Verdict

DeepSeek V3.2 is the best overall choice for NSFW roleplay — not because it’s the highest quality, but because it’s the only model that combines decent quality, NSFW freedom, and affordable pricing. Every other model requires compromises on at least one of these axes.


Claude Haiku 4.5: The Editor

Role: Quality rewrite pass (polishes DS3.2 drafts) + fallback for non-NSFW

Strengths

Weaknesses

How We Use It

Claude Haiku is NOT our primary model. It’s our quality editor:

User message → DeepSeek V3.2 (draft, uncensored)

              Claude Haiku (rewrite for quality)

              Censorship check:
                ├── Rewrite OK → use polished version
                └── Rewrite censored → use original DS3.2 draft

The censorship detection looks for:

When the rewrite pipeline works (non-explicit scenes), the quality improvement is noticeable — better word choice, more natural rhythm, stronger character voice. For explicit scenes, we skip it entirely and serve the DS3.2 draft.

Cost Optimization: Prompt Caching

Using Anthropic’s native API (not via OpenRouter), we enable prompt caching for the system prompt. Since the character persona rarely changes, cached tokens cost 1/10th of uncached. This makes the rewrite pass much cheaper per message.

Verdict

Claude Haiku is the best prose writer in our stack, but its NSFW restrictions make it unsuitable as a primary model. As a quality layer on top of DS3.2, it’s worth the extra cost for characters where voice quality matters most.


Gemini 2.5 Flash: The Director

Role: NPC concept generation and scene direction

Strengths

Weaknesses

How We Use It

Gemini excels at high-level creative direction rather than line-by-line dialogue:

We don’t use it for primary character chat because it’s less reliable at maintaining character voice, but for creative tasks where variety matters more than consistency, it’s excellent.

Verdict

The best “creative consultant” in our stack. We wouldn’t use it as a primary chat model, but for NPC generation and scene setup, it’s the best balance of creativity, NSFW tolerance, and cost.


GLM-5: The World Builder

Role: Scene descriptions and atmosphere

Strengths

Weaknesses

How We Use It

GLM-5 handles “RP room scenarios” — the immersive scene descriptions that set the stage:

Rain drums against the floor-to-ceiling windows of the 40th floor office. The city lights blur into watercolor smudges below. A single desk lamp casts a warm pool of light across scattered documents…

For pure atmosphere, GLM-5 outperforms every other model in our stack.

Verdict

A specialist. Not for everything, but unmatched for environmental descriptions and world-building.


The Real Comparison: Side by Side

Here’s the same prompt sent to each model (a character greeting the user in the morning):

DeepSeek V3.2:

Sakura stretches at her desk, coffee already half-empty Oh, morning. You’re early today. …Or am I late again? checks phone Nope, you’re just early. Weirdo.

Claude Haiku 4.5:

The soft click of a keyboard pauses as Sakura glances up, coffee cup suspended halfway to her lips …You’re here already? She sets the cup down carefully, a small smile tugging at the corner of her mouth despite her best efforts I was just getting to the good part of this draft. You always have the worst timing.

Gemini 2.5 Flash:

Good morning! Sakura looks up from her laptop with a surprised expression. “Oh, you’re early! I wasn’t expecting you for another hour.” She quickly saves her document and adjusts her glasses. “Want some coffee? I just made a fresh pot.”

GLM-5:

Morning light filters through half-drawn blinds, catching dust motes in golden suspension. The office carries the bitter-sweet scent of over-steeped coffee. …Mm? Sakura’s pen halts mid-sentence, her gaze lifting from the manuscript with that particular unfocused look of someone pulled from deep concentration.

Analysis

This is why we use multiple models. Each has a superpower. (For more on how we route between them, see Navigating AI Content Filters for Adult RP.)


Cost Comparison (Real Production Data)

Monthly costs running Suzune with multiple characters:

ModelUsageMonthly Cost
DeepSeek V3.2~60% of all calls$15–25
Claude Haiku~25% (rewrites)$5–10
Gemini Flash~10% (NPCs)$2–3
GLM-5~5% (scenes)$1–2
Total$23–40

If we ran everything on Claude Haiku: $150–250/month. The multi-model approach saves 80%+ while maintaining quality where it matters. (Full cost breakdown: Running an AI Bot on $50/month.)


Recommendations

If you’re building an RP bot:

  1. Start with DeepSeek V3.2 as your primary. Best cost/quality/freedom ratio.
  2. Add Claude Haiku as a quality layer if you can afford the extra cost.
  3. Use Gemini for creative tasks (NPC generation, plot direction).
  4. Route through OpenRouter — one API key for all models, easy switching.

If you just want to chat with AI characters:


This article is based on production data from Suzune. Model performance may vary depending on use case, prompt design, and language. We’ll update this comparison as new models are released.

See also: Prompt Engineering for Immersive Roleplay for how we design prompts that work across models.


Share this post on:

Previous Post
Dynamic Character Visuals: How One Character Can Look Like Two Different People
Next Post
Navigating AI Content Filters for Adult RP: An Architecture Guide