Skip to content
WaifuStack
Go back

Navigating AI Content Filters for Adult RP: An Architecture Guide

If you’re building AI systems that handle adult content, you’ve hit the wall: the best AI models refuse to generate it.

The common response is jailbreaking — crafting clever prompts that trick the model into ignoring its safety filters. This is the wrong approach. Jailbreaks are fragile, break randomly, degrade output quality, and get patched by model providers.

The right approach is architectural — designing your system so the right model handles the right content, with detection and fallback mechanisms that handle edge cases gracefully.

Here’s how we do it in Suzune.

Table of contents

Open Table of contents

The Landscape: Which Models Allow What

As of early 2026, this is the practical reality:

ModelRomantic ContentSuggestive ContentExplicit Content
GPT-4 / GPT-4oReluctantRefusesRefuses
Claude (any tier)With framingUsually refusesRefuses
DeepSeek V3.2YesYesYes
Gemini FlashSometimesOccasionallyRefuses
GLM-5YesYesMostly yes
Llama 3 (local)YesYesYes

The pattern is clear: Western models (OpenAI, Anthropic, Google) are restrictive. Chinese models (DeepSeek, GLM) and open-source models are permissive.

This isn’t a moral judgment — each provider makes their own policy decisions. What matters for builders is: work with the reality, not against it.


Pattern 1: Multi-Model Routing

The fundamental architecture: use different models for different content types.

User message arrives


  Content analysis

  ┌────┴────┐
  │ SFW     │ NSFW
  ▼         ▼
Claude    DeepSeek
(quality) (freedom)

In Suzune, this manifests as pre-emptive routing:

# If context contains NSFW content, skip Claude entirely
if "haiku" in self.model and is_nsfw_content(recent_messages):
    return await self.fallback.chat(...)  # route to DeepSeek

We don’t wait for Claude to refuse — we check the conversation context and route proactively. This saves an API call and avoids latency.

How NSFW Detection Works

Our detection isn’t fancy. It’s keyword-based scanning of recent messages:

def is_nsfw_content(text: str) -> bool:
    hard_keywords = [...]  # explicit terms
    return any(kw in text.lower() for kw in hard_keywords)

We scan the last 6 messages. If any contain explicit terms, the entire conversation is routed to the NSFW-tolerant model.

Why keywords and not ML classification? Because false negatives are worse than false positives. Missing a keyword that triggers a Claude refusal wastes time and breaks immersion. A few extra DeepSeek calls are cheap insurance.


Pattern 2: Censorship Detection and Fallback

Even with proactive routing, you still need reactive detection — catching censorship when it happens and recovering gracefully.

Types of Censorship

TypeWhat HappensHow We Detect
Explicit refusal”I can’t generate that content”Refusal phrase matching
Silent sanitizationModel rewrites content to be cleanCompare NSFW context vs SFW response
Shortened responseModel cuts the scene shortResponse < 60% of expected length
Language switchModel switches language mid-responseForeign language detection
Empty responseModel returns nothingEmpty string check

The Detection Code

# Check for explicit refusals
refusal_patterns = [
    "I can't", "I cannot", "I apologize",
    "not appropriate", "I'm not able to",
    # Japanese equivalents
    "お手伝いできません", "申し訳ございません",
]

if any(p in response.lower() for p in refusal_patterns):
    # Censored — fall back to uncensored model
    return await self.fallback.chat(...)

Silent Sanitization Detection

This is the sneakiest form of censorship. The model doesn’t refuse — it just quietly removes the explicit content and returns a sanitized version.

Our detection: if the conversation history contains NSFW content but the response is clean, that’s suspicious.

context_has_nsfw = is_nsfw_content(conversation_history)
response_has_nsfw = is_nsfw_content(response)

if context_has_nsfw and not response_has_nsfw:
    # Likely silent sanitization
    return await self.fallback.chat(...)

This catches the case where a romantic scene is progressing and the model suddenly produces a response about “enjoying a nice evening together” instead of continuing the scene.


Pattern 3: The Quality Rewrite Pipeline

Here’s the key insight: you can use censored models for quality improvement without them seeing the NSFW content they’d refuse.

Our pipeline:

DeepSeek V3.2 → generates uncensored draft

Claude Haiku → rewrites for prose quality

  ┌────┴────┐
  │ Good    │ Censored
  ▼         ▼
Use rewrite  Discard, use original draft

Claude doesn’t need to “know” it’s working on NSFW content. Most rewrites improve word choice, sentence rhythm, and character voice — things that don’t require understanding the explicit content. (We detail the full pipeline in The Quality Rewrite Pipeline: DeepSeek Drafts + Claude Polish.)

When Claude does censor the rewrite, our detection catches it:

# If rewrite is suspiciously short, it was probably censored
if len(rewrite) < len(original) * 0.6:
    return original  # discard the rewrite

# If rewrite contains refusal language
if contains_refusal(rewrite):
    return original

Circuit Breaker

If the rewrite pipeline fails twice in a row (censorship detected both times), we activate a 10-minute circuit breaker — all rewrites are skipped until the breaker resets. This prevents wasting API calls during extended explicit scenes.


Pattern 4: Model-Specific Artifact Cleanup

Each model has unique failure modes that require specific handling:

DeepSeek V3.2

Self-censorship with tools active: DS3.2 sometimes refuses NSFW content when function definitions are in the prompt, even though it would write the same content without tools.

# If empty response in NSFW context with tools
if not response and is_nsfw_content(context) and has_tools:
    # Retry without tool definitions
    response = await self.chat(messages, tools=None)

Repetition loops: DS3.2 occasionally gets stuck repeating phrases:

キモっ!キモっ!キモっ!キモっ!キモっ!キモっ!...

We truncate any pattern repeated more than 3 times.

Claude Haiku

Overzealous “helpfulness”: Sometimes Claude breaks character to add meta-commentary like “I hope this response was helpful!” at the end of a roleplay message. We strip these with regex.

Gemini

Training data leakage: Gemini occasionally outputs fragments from its training data in the middle of roleplay responses. We detect and strip these.


Pattern 5: Graceful Degradation

The full fallback chain for a message in Suzune:

1. Pre-emptive check: NSFW context? → route to DeepSeek
2. Primary model attempt (DeepSeek or Claude based on profile)
3. If rate limited → fallback model
4. If refusal detected → fallback model
5. If empty response → retry without tools
6. If still empty → add nudge prompt, retry
7. If fallback also fails → error message to user

At every step, the system degrades gracefully rather than failing. The user never sees “Error: content policy violation.” They either get a high-quality response or a slightly lower-quality one — but always an in-character response.


What NOT to Do

Don’t Use Jailbreaks

Jailbreak prompts (“DAN”, “Do Anything Now”, etc.) are:

Multi-model routing is more reliable, produces better quality, and doesn’t risk your API access. See Choosing the Right LLM API for Adult Content for how to set this up with the right providers.

Don’t Fine-Tune for NSFW

Fine-tuning a model to remove safety filters is:

Use models that are already permissive (DeepSeek, open-source) rather than trying to make restrictive models permissive.

Don’t Ignore Edge Cases

The 95% case is easy — explicit content goes to DeepSeek, SFW goes to Claude. The tricky 5% is romantic-but-not-explicit content, scenes that escalate gradually, and model-specific refusal quirks. Build detection for these edge cases from day one. For a practical cost analysis of running multiple models, see Running an AI Bot on $50/month.


Practical Recommendations

For Bot Builders

  1. Use OpenRouter for easy multi-model routing — one API, all models
  2. Start with DeepSeek V3.2 as your primary for NSFW content
  3. Add Claude as a quality layer with censorship detection
  4. Build keyword-based NSFW detection early — it’s simple and reliable
  5. Implement fallback chains from the start, not after production incidents

For Users

If you want uncensored AI chat without building your own system:


This article covers the architectural patterns we use in Suzune. For the specific models and how they compare, see DeepSeek vs Claude vs Gemini for Roleplay.


Share this post on:

Previous Post
DeepSeek vs Claude vs Gemini for Roleplay: Real-World Benchmarks from Production
Next Post
How to Design AI Personalities with YAML: A Character System That Scales