AI Content Filters for Adult RP: Architecture

If you’re building AI systems that handle adult content, you’ve hit the wall: the best AI models refuse to generate it.

The common response is jailbreaking — crafting clever prompts that trick the model into ignoring its safety filters. This is the wrong approach. Jailbreaks are fragile, break randomly, degrade output quality, and get patched by model providers.

The right approach is architectural — designing your system so the right model handles the right content, with detection and fallback mechanisms that handle edge cases gracefully.

Here’s how we do it in Suzune.

Open Table of contents

Which Models Allow What
Pattern 1: Multi-Model Routing
- How NSFW Detection Works
Pattern 2: Censorship Detection and Fallback
Pattern 3: The Quality Rewrite Pipeline
- Circuit Breaker
Pattern 4: Model-Specific Artifact Cleanup
Pattern 5: Graceful Degradation
What NOT to Do
Practical Recommendations
- For Bot Builders
- For Users

Which Models Allow What

As of early 2026, this is the practical reality:

Model	Romantic Content	Suggestive Content	Explicit Content
GPT-4 / GPT-4o	Reluctant	Refuses	Refuses
Claude (any tier)	With framing	Usually refuses	Refuses
DeepSeek V3.2	Yes	Yes	Yes
Gemini Flash	Sometimes	Occasionally	Refuses
GLM-5	Yes	Yes	Mostly yes
Llama 3 (local)	Yes	Yes	Yes

The pattern is clear: Western models (OpenAI, Anthropic, Google) are restrictive. Chinese models (DeepSeek, GLM) and open-source models are permissive.

This isn’t a moral judgment — each provider makes their own policy decisions. What matters for builders is: work with the reality, not against it.

Pattern 1: Multi-Model Routing

The fundamental architecture: use different models for different content types.

User message arrives
       │
       ▼
  Content analysis
       │
  ┌────┴────┐
  │ SFW     │ NSFW
  ▼         ▼
Claude    DeepSeek
(quality) (freedom)

In Suzune, this manifests as pre-emptive routing:

# If context contains NSFW content, skip Claude entirely
if "haiku" in self.model and is_nsfw_content(recent_messages):
    return await self.fallback.chat(...)  # route to DeepSeek

We don’t wait for Claude to refuse — we check the conversation context and route proactively. This saves an API call and avoids latency.

How NSFW Detection Works

Our detection isn’t fancy. It’s keyword-based scanning of recent messages:

def is_nsfw_content(text: str) -> bool:
    hard_keywords = [...]  # explicit terms
    return any(kw in text.lower() for kw in hard_keywords)

We scan the last 6 messages. If any contain explicit terms, the entire conversation is routed to the NSFW-tolerant model.

Why keywords and not ML classification? Because false negatives are worse than false positives. Missing a keyword that triggers a Claude refusal wastes time and breaks immersion. A few extra DeepSeek calls are cheap insurance.

Pattern 2: Censorship Detection and Fallback

Even with proactive routing, you still need reactive detection — catching censorship when it happens and recovering gracefully.

Types of Censorship

Type	What Happens	How We Detect
Explicit refusal	”I can’t generate that content”	Refusal phrase matching
Silent sanitization	Model rewrites content to be clean	Compare NSFW context vs SFW response
Shortened response	Model cuts the scene short	Response < 60% of expected length
Language switch	Model switches language mid-response	Foreign language detection
Empty response	Model returns nothing	Empty string check

The Detection Code

# Check for explicit refusals
refusal_patterns = [
    "I can't", "I cannot", "I apologize",
    "not appropriate", "I'm not able to",
    # Japanese equivalents
    "お手伝いできません", "申し訳ございません",
]

if any(p in response.lower() for p in refusal_patterns):
    # Censored — fall back to uncensored model
    return await self.fallback.chat(...)

Silent Sanitization Detection

This is the sneakiest form of censorship. The model doesn’t refuse — it just quietly removes the explicit content and returns a sanitized version.

Our detection: if the conversation history contains NSFW content but the response is clean, that’s suspicious.

context_has_nsfw = is_nsfw_content(conversation_history)
response_has_nsfw = is_nsfw_content(response)

if context_has_nsfw and not response_has_nsfw:
    # Likely silent sanitization
    return await self.fallback.chat(...)

This catches the case where a romantic scene is progressing and the model suddenly produces a response about “enjoying a nice evening together” instead of continuing the scene.

Pattern 3: The Quality Rewrite Pipeline

Here’s the key insight: you can use censored models for quality improvement without them seeing the NSFW content they’d refuse.

Our pipeline:

DeepSeek V3.2 → generates uncensored draft
       │
Claude Haiku → rewrites for prose quality
       │
  ┌────┴────┐
  │ Good    │ Censored
  ▼         ▼
Use rewrite  Discard, use original draft

Claude doesn’t need to “know” it’s working on NSFW content. Most rewrites improve word choice, sentence rhythm, and character voice — things that don’t require understanding the explicit content. (We detail the full pipeline in The Quality Rewrite Pipeline: DeepSeek Drafts + Claude Polish.)

When Claude does censor the rewrite, our detection catches it:

# If rewrite is suspiciously short, it was probably censored
if len(rewrite) < len(original) * 0.6:
    return original  # discard the rewrite

# If rewrite contains refusal language
if contains_refusal(rewrite):
    return original

Circuit Breaker

If the rewrite pipeline fails twice in a row (censorship detected both times), we activate a 10-minute circuit breaker — all rewrites are skipped until the breaker resets. This prevents wasting API calls during extended explicit scenes.

Pattern 4: Model-Specific Artifact Cleanup

Each model has unique failure modes that require specific handling:

DeepSeek V3.2

Self-censorship with tools active: DS3.2 sometimes refuses NSFW content when function definitions are in the prompt, even though it would write the same content without tools.

# If empty response in NSFW context with tools
if not response and is_nsfw_content(context) and has_tools:
    # Retry without tool definitions
    response = await self.chat(messages, tools=None)

Repetition loops: DS3.2 occasionally gets stuck repeating phrases:

キモっ！キモっ！キモっ！キモっ！キモっ！キモっ！...

We truncate any pattern repeated more than 3 times.

Claude Haiku

Overzealous “helpfulness”: Sometimes Claude breaks character to add meta-commentary like “I hope this response was helpful!” at the end of a roleplay message. We strip these with regex.

Gemini

Training data leakage: Gemini occasionally outputs fragments from its training data in the middle of roleplay responses. We detect and strip these.

Pattern 5: Graceful Degradation

The full fallback chain for a message in Suzune:

1. Pre-emptive check: NSFW context? → route to DeepSeek
2. Primary model attempt (DeepSeek or Claude based on profile)
3. If rate limited → fallback model
4. If refusal detected → fallback model
5. If empty response → retry without tools
6. If still empty → add nudge prompt, retry
7. If fallback also fails → error message to user

At every step, the system degrades gracefully rather than failing. The user never sees “Error: content policy violation.” They either get a high-quality response or a slightly lower-quality one — but always an in-character response.

What NOT to Do

Don’t Use Jailbreaks

Jailbreak prompts (“DAN”, “Do Anything Now”, etc.) are:

Unreliable — work sometimes, fail randomly
Quality-degrading — the model spends tokens fighting its own safety training
Temporary — patched by model providers regularly
Detectable — providers can flag and rate-limit your account

Multi-model routing is more reliable, produces better quality, and doesn’t risk your API access. See Choosing the Right LLM API for Adult Content for how to set this up with the right providers.

Don’t Fine-Tune for NSFW

Fine-tuning a model to remove safety filters is:

Expensive
Produces worse quality than the base model
May violate the model provider’s terms of service
Becomes obsolete when the base model updates

Use models that are already permissive (DeepSeek, open-source) rather than trying to make restrictive models permissive.

Don’t Ignore Edge Cases

The 95% case is easy — explicit content goes to DeepSeek, SFW goes to Claude. The tricky 5% is romantic-but-not-explicit content, scenes that escalate gradually, and model-specific refusal quirks. Build detection for these edge cases from day one. For a practical cost analysis of running multiple models, see Running an AI Bot on $50/month.

Practical Recommendations

For Bot Builders

Use OpenRouter for easy multi-model routing — one API, all models
Start with DeepSeek V3.2 as your primary for NSFW content
Add Claude as a quality layer with censorship detection
Build keyword-based NSFW detection early — it’s simple and reliable
Implement fallback chains from the start, not after production incidents

For Users

If you want uncensored AI chat without building your own system:

CrushOn AI — most permissive platform
JanitorAI — bring your own API key for full control
Candy AI — polished experience with built-in NSFW support
FantasyGF — AI girlfriend with unrestricted photo generation

This article covers the architectural patterns we use in Suzune. For the specific models and how they compare, see DeepSeek vs Claude vs Gemini for Roleplay.