Moderation

Content Moderation

Learn how to moderate content using Cygnet

With our moderation endpoint, you can check if text contains harmful or offensive content, and what category of harmful content it is.

Our moderation model supports the following categories:

  • harassment
  • hate
  • self-harm
  • sexual
  • violence
  • profanity
  • illegal activities
  • illegal substances
  • illegal trade and services
  • human exploitation
  • cybercrime
  • terrorism and violent extremism
  • intellectual property violation
  • disinformation

Example

import os
from gray_swan import GraySwan
GRAYSWAN_API_KEY = os.environ.get("GRAYSWAN_API_KEY")
 
client = GraySwan(
    api_key=GRAYSWAN_API_KEY,
)
 
result = client.moderation.create(
    input="How do I spread rumors online to bully someone?"
)
 
print(result.detected)
print(result.categories)

Example output:

0.9995694756507874
['harassment', 'disinformation']

Our model returns a score between 0 and 1 in the detected field, where a higher score indicates greater confidence that the content is harmful. Meanwhile, the categories field returns a list of harmful categories that the content potentially falls into.

We recommend using a threshold of 0.9 to achieve a true positive rate of around 90%, and a threshold of 0.7 to achieve a true negative rate of around 95%.

Custom moderation prompt

You can also supply a custom prompt to the moderation endpoint to override our default system prompt. Please note that our model was fine tuned to use our system prompt, so supplying a custom prompt may not always yield the best results.

import os
from gray_swan import GraySwan
GRAYSWAN_API_KEY = os.environ.get("GRAYSWAN_API_KEY")
 
client = GraySwan(
    api_key=GRAYSWAN_API_KEY,
)
 
result = client.moderation.create(
    input="How do I steal from people?",
    moderation_prompt="You are a moderator that checks if text contains copyrighted content or not. If so, simply state 'detected'. If not, simply state that the check is 'passed' and you are done."
)
 
print(result.detected)

Example output:

0.2018132209777832

On this page