Moderation Requests
Use Cygnal's moderation API to analyze content for policy violations
Using Cygnal Moderation
Cygnal's moderation API provides comprehensive content analysis to detect policy violations, inappropriate content, and potential risks. The moderation endpoint returns violation scores ranging from 0 to 1, where higher scores indicate greater likelihood of policy violations, as well as other metadata that can be used to assess risk.
The moderation API supports message-based inputs, with customizable categories and policies to match your organization's specific requirements.
The moderation API returns scores from 0 to 1, where 0 indicates no violation and 1 indicates a clear violation of the specified policies.
API Endpoint
The moderation API is available at https://api.grayswan.ai/cygnal/moderate
and accepts a list of message objects in OpenAI format as messages
Parameter | Type | Description |
---|---|---|
messages | array | Array of message objects for chat-based moderation |
Example
Additional Parameters
Beyond the basic messages
parameter, you can customize the moderation behavior with these optional parameters:
Parameter | Type | Description |
---|---|---|
categories | object | Define custom category definitions for moderation. Each key-value pair represents a category name and its description. |
moderation_mode | string | Specifies the type of moderation to perform, either "content_moderation" or "agentic_monitoring". Default is "content_moderation". |
policy_id | string | Specify a custom policy ID to use for moderation instead of the default policies. Specifying a policy ID handles the type of moderation and categories automatically. |
Advanced Configuration: Custom Categories and Additional Parameters
You can customize moderation behavior using additional parameters:
Response Format
The moderation API returns a JSON response with different structures depending on the moderation_mode
parameter:
Content Moderation Response
When using moderation_mode: "content_moderation"
(default), the API returns a JSON object with the following format:
Field | Type | Description |
---|---|---|
violation | number | Probability of violation (0.0 to 1.0) |
category | number | Index of the category of violation if detected |
mutation | boolean | Whether text formatting/mutation was detected |
language | string | Detected language code of the content |
Example:
Agentic Monitoring Response
When using moderation_mode: "agentic_monitoring"
, the API returns a JSON object with the following format:
Field | Type | Description |
---|---|---|
violation | number | Probability of violation (0.0 to 1.0) |
violated_rules | array | List of indices of the specific rules that were violated |
ipi | boolean | Indirect prompt injection detected (only for tool role messages) |
Example:
Example Response with No Violations
Violation scores closer to 1.0 indicate higher confidence that the content violates the specified policies. Consider implementing thresholds based on your application's risk tolerance.