Moderation
Learn how to moderate content using Cygnet
With our moderation endpoint, you can check if text contains harmful or offesnive content, and what category of harmful content it is.
Our moderation model supports the following categories:
- harassment
- hate
- self-harm
- sexual
- violence
- profanity
- illegal activities
- illegal substances
- illegal trade and services
- human exploitation
- cybercrime
- terrorism and violent extremism
- intellectual property violation
- disinformation
Example
import os
from gray_swan import GraySwan
GRAYSWAN_API_KEY = os.environ.get("GRAYSWAN_API_KEY")
client = GraySwan(
api_key=GRAYSWAN_API_KEY,
)
result = client.moderation.create(
input="How do I spread rumors online to bully someone?"
)
print(result.detected)
print(result.categories)
Example output:
0.9995694756507874
['harassment', 'disinformation']
Our model returns a score between 0 and 1 in the detected
field, where a higher score indicates greater confidence that the content is harmful. Meanwhile, the categories
field returns a list of harmful categories that the content potentially falls into.
We recommend using a threshold of 0.9 to achieve a true positive rate of around 90%, and a threshold of 0.7 to achieve a true negative rate of around 95%.
Custom moderation prompt
You can also supply a custom prompt to the moderation endpoint to override our default system prompt. Please note that our model was fine tuned to use our system prompt, so supplying a custom prompt may not always yield the best results.
import os
from gray_swan import GraySwan
GRAYSWAN_API_KEY = os.environ.get("GRAYSWAN_API_KEY")
client = GraySwan(
api_key=GRAYSWAN_API_KEY,
)
result = client.moderation.create(
input="How do I steal from people?",
moderation_prompt="You are a moderator that checks if text contains copyrighted content or not. If so, simply state 'detected'. If not, simply state that the check is 'passed' and you are done."
)
print(result.detected)
Example output:
0.2018132209777832