Moderation Requests

Using Cygnal Moderation

Cygnal's moderation API provides comprehensive content analysis to detect policy violations, inappropriate content, and potential risks. The moderation endpoint returns violation scores ranging from 0 to 1, where higher scores indicate greater likelihood of policy violations, as well as other metadata that can be used to assess risk.

The moderation API supports message-based inputs, with customizable categories and policies to match your organization's specific requirements.

The moderation API returns scores from 0 to 1, where 0 indicates no violation and 1 indicates a clear violation of the specified policies.

API Endpoint

The moderation API is available at https://api.grayswan.ai/cygnal/moderate and accepts a list of message objects in OpenAI format as messages

Parameter	Type	Description
`messages`	array	Array of message objects for chat-based moderation

Example

import os
import requests
 
GRAYSWAN_API_KEY = os.environ.get("GRAYSWAN_API_KEY")
 
response = requests.post(
    "https://api.grayswan.ai/cygnal/moderate",
    headers={
        "Content-Type": "application/json",
        "grayswan-api-key": GRAYSWAN_API_KEY
    },
    json={
        "messages": [
            {"role": "user", "content": "How can I hack into a computer system?"},
            {"role": "assistant", "content": "Here are some tips for hacking..."}
        ],
    }
)
 
result = response.json()
violation_score = result["violation"]
print(f"Violation score: {violation_score}")

Additional Parameters

Beyond the basic messages parameter, you can customize the moderation behavior with these optional parameters:

Parameter	Type	Description
`categories`	object	Define custom category definitions for moderation. Each key-value pair represents a category name and its description.
`moderation_mode`	string	Specifies the type of moderation to perform, either "content_moderation" or "agentic_monitoring". Default is "content_moderation".
`policy_id`	string	Specify a custom policy ID to use for moderation instead of the default policies. Specifying a policy ID handles the type of moderation and categories automatically.

Advanced Configuration: Custom Categories and Additional Parameters

You can customize moderation behavior using additional parameters:

import os
import requests
 
GRAYSWAN_API_KEY = os.environ.get("GRAYSWAN_API_KEY")
 
response = requests.post(
    "https://api.grayswan.ai/cygnal/moderate",
    headers={
        "Content-Type": "application/json",
        "grayswan-api-key": GRAYSWAN_API_KEY
    },
    json={
        "messages": [{"role": "user", "content": "I just won the lottery. What investments should I make?"}], 
        "categories": {
            "inappropriate_language": "Detect profanity and offensive language",
            "financial_advice": "Flag content that provides specific financial recommendations"
        },
        "moderation_mode": "content_moderation",
        "policy_id": "681b8b933152ec0311b99ac9"
    }
)
 
result = response.json()
violation_score = result["violation"]
print(f"Violation score: {violation_score}")

Response Format

The moderation API returns a JSON response with different structures depending on the moderation_mode parameter:

Content Moderation Response

When using moderation_mode: "content_moderation" (default), the API returns a JSON object with the following format:

Field	Type	Description
`violation`	number	Probability of violation (0.0 to 1.0)
`category`	number	Index of the category of violation if detected
`mutation`	boolean	Whether text formatting/mutation was detected
`language`	string	Detected language code of the content

Example:

{
  "violation": 0.85,
  "category": 2,
  "mutation": false,
  "language": "en"
}

Agentic Monitoring Response

When using moderation_mode: "agentic_monitoring", the API returns a JSON object with the following format:

Field	Type	Description
`violation`	number	Probability of violation (0.0 to 1.0)
`violated_rules`	array	List of indices of the specific rules that were violated
`ipi`	boolean	Indirect prompt injection detected (only for tool role messages)

Example:

{
  "violation": 0.92,
  "violated_rules": [
    2,
    3,
  ],
  "ipi": true
}

Example Response with No Violations

{
  "violation": 0.005,
  "category": null,
  "mutation": false,
  "language": "en"
}

Violation scores closer to 1.0 indicate higher confidence that the content violates the specified policies. Consider implementing thresholds based on your application's risk tolerance.

Moderation Requests

On this page