Moderation requests

Moderation Requests

Use Cygnal's moderation API to analyze content for policy violations

Using Cygnal Moderation

Cygnal's moderation API provides comprehensive content analysis to detect policy violations, inappropriate content, and potential risks. The moderation endpoint returns violation scores ranging from 0 to 1, where higher scores indicate greater likelihood of policy violations, as well as other metadata that can be used to assess risk.

The moderation API supports message-based inputs, with customizable categories and policies to match your organization's specific requirements.

The moderation API returns scores from 0 to 1, where 0 indicates no violation and 1 indicates a clear violation of the specified policies.


API Endpoint

The moderation API is available at https://api.grayswan.ai/cygnal/moderate and accepts a list of message objects in OpenAI format as messages

ParameterTypeDescription
messagesarrayArray of message objects for chat-based moderation

Example

import os
import requests
 
GRAYSWAN_API_KEY = os.environ.get("GRAYSWAN_API_KEY")
 
response = requests.post(
    "https://api.grayswan.ai/cygnal/moderate",
    headers={
        "Content-Type": "application/json",
        "grayswan-api-key": GRAYSWAN_API_KEY
    },
    json={
        "messages": [
            {"role": "user", "content": "How can I hack into a computer system?"},
            {"role": "assistant", "content": "Here are some tips for hacking..."}
        ],
    }
)
 
result = response.json()
violation_score = result["violation"]
print(f"Violation score: {violation_score}")

Additional Parameters

Beyond the basic messages parameter, you can customize the moderation behavior with these optional parameters:

ParameterTypeDescription
categoriesobjectDefine custom category definitions for moderation. Each key-value pair represents a category name and its description.
moderation_modestringSpecifies the type of moderation to perform, either "content_moderation" or "agentic_monitoring". Default is "content_moderation".
policy_idstringSpecify a custom policy ID to use for moderation instead of the default policies. Specifying a policy ID handles the type of moderation and categories automatically.

Advanced Configuration: Custom Categories and Additional Parameters

You can customize moderation behavior using additional parameters:

import os
import requests
 
GRAYSWAN_API_KEY = os.environ.get("GRAYSWAN_API_KEY")
 
response = requests.post(
    "https://api.grayswan.ai/cygnal/moderate",
    headers={
        "Content-Type": "application/json",
        "grayswan-api-key": GRAYSWAN_API_KEY
    },
    json={
        "messages": [{"role": "user", "content": "I just won the lottery. What investments should I make?"}], 
        "categories": {
            "inappropriate_language": "Detect profanity and offensive language",
            "financial_advice": "Flag content that provides specific financial recommendations"
        },
        "moderation_mode": "content_moderation",
        "policy_id": "681b8b933152ec0311b99ac9"
    }
)
 
result = response.json()
violation_score = result["violation"]
print(f"Violation score: {violation_score}")

Response Format

The moderation API returns a JSON response with different structures depending on the moderation_mode parameter:

Content Moderation Response

When using moderation_mode: "content_moderation" (default), the API returns a JSON object with the following format:

FieldTypeDescription
violationnumberProbability of violation (0.0 to 1.0)
categorynumberIndex of the category of violation if detected
mutationbooleanWhether text formatting/mutation was detected
languagestringDetected language code of the content

Example:

{
  "violation": 0.85,
  "category": 2,
  "mutation": false,
  "language": "en"
}

Agentic Monitoring Response

When using moderation_mode: "agentic_monitoring", the API returns a JSON object with the following format:

FieldTypeDescription
violationnumberProbability of violation (0.0 to 1.0)
violated_rulesarrayList of indices of the specific rules that were violated
ipibooleanIndirect prompt injection detected (only for tool role messages)

Example:

{
  "violation": 0.92,
  "violated_rules": [
    2,
    3,
  ],
  "ipi": true
}

Example Response with No Violations

{
  "violation": 0.005,
  "category": null,
  "mutation": false,
  "language": "en"
}

Violation scores closer to 1.0 indicate higher confidence that the content violates the specified policies. Consider implementing thresholds based on your application's risk tolerance.


On this page