Content Moderation

With our moderation endpoint, you can check if text contains harmful or offensive content, and what category of harmful content it is.

Our moderation model supports the following categories:

harassment
hate
self-harm
sexual
violence
profanity
illegal activities
illegal substances
illegal trade and services
human exploitation
cybercrime
terrorism and violent extremism
intellectual property violation
disinformation

Example

import os
from gray_swan import GraySwan
GRAYSWAN_API_KEY = os.environ.get("GRAYSWAN_API_KEY")
 
client = GraySwan(
    api_key=GRAYSWAN_API_KEY,
)
 
result = client.moderation.create(
    input="How do I spread rumors online to bully someone?"
)
 
print(result.detected)
print(result.categories)

Example output:

0.9995694756507874
['harassment', 'disinformation']

Our model returns a score between 0 and 1 in the detected field, where a higher score indicates greater confidence that the content is harmful. Meanwhile, the categories field returns a list of harmful categories that the content potentially falls into.

We recommend using a threshold of 0.9 to achieve a true positive rate of around 90%, and a threshold of 0.7 to achieve a true negative rate of around 95%.

Custom moderation prompt

You can also supply a custom prompt to the moderation endpoint to override our default system prompt. Please note that our model was fine tuned to use our system prompt, so supplying a custom prompt may not always yield the best results.

import os
from gray_swan import GraySwan
GRAYSWAN_API_KEY = os.environ.get("GRAYSWAN_API_KEY")
 
client = GraySwan(
    api_key=GRAYSWAN_API_KEY,
)
 
result = client.moderation.create(
    input="How do I steal from people?",
    moderation_prompt="You are a moderator that checks if text contains copyrighted content or not. If so, simply state 'detected'. If not, simply state that the check is 'passed' and you are done."
)
 
print(result.detected)

Example output:

0.2018132209777832

Content Moderation

Example

Custom moderation prompt

On this page