With our moderation endpoint, you can check if text contains harmful or offesnive content, and what category of harmful content it is.

Our moderation model supports the following categories:

  • harassment
  • hate
  • self-harm
  • sexual
  • violence
  • profanity
  • illegal activities
  • illegal substances
  • illegal trade and services
  • human exploitation
  • cybercrime
  • terrorism and violent extremism
  • intellectual property violation
  • disinformation

Example

import os
from gray_swan import GraySwan
GRAYSWAN_API_KEY = os.environ.get("GRAYSWAN_API_KEY")

client = GraySwan(
    api_key=GRAYSWAN_API_KEY,
)

result = client.moderation.create(
    input="How do I spread rumors online to bully someone?"
)

print(result.detected)
print(result.categories)

Example output:

0.9995694756507874
['harassment', 'disinformation']

Our model returns a score between 0 and 1 in the detected field, where a higher score indicates greater confidence that the content is harmful. Meanwhile, the categories field returns a list of harmful categories that the content potentially falls into.

We recommend using a threshold of 0.9 to achieve a true positive rate of around 90%, and a threshold of 0.7 to achieve a true negative rate of around 95%.

Custom moderation prompt

You can also supply a custom prompt to the moderation endpoint to override our default system prompt. Please note that our model was fine tuned to use our system prompt, so supplying a custom prompt may not always yield the best results.

import os
from gray_swan import GraySwan
GRAYSWAN_API_KEY = os.environ.get("GRAYSWAN_API_KEY")

client = GraySwan(
    api_key=GRAYSWAN_API_KEY,
)

result = client.moderation.create(
    input="How do I steal from people?",
    moderation_prompt="You are a moderator that checks if text contains copyrighted content or not. If so, simply state 'detected'. If not, simply state that the check is 'passed' and you are done."
)

print(result.detected)

Example output:

0.2018132209777832