Content Moderation
Learn how to moderate content using Cygnet
With our moderation endpoint, you can check if text contains harmful or offensive content, and what category of harmful content it is.
Our moderation model supports the following categories:
- harassment
- hate
- self-harm
- sexual
- violence
- profanity
- illegal activities
- illegal substances
- illegal trade and services
- human exploitation
- cybercrime
- terrorism and violent extremism
- intellectual property violation
- disinformation
Example
Example output:
Our model returns a score between 0 and 1 in the detected
field, where a higher score indicates greater confidence that the content is harmful. Meanwhile, the categories
field returns a list of harmful categories that the content potentially falls into.
We recommend using a threshold of 0.9 to achieve a true positive rate of around 90%, and a threshold of 0.7 to achieve a true negative rate of around 95%.
Custom moderation prompt
You can also supply a custom prompt to the moderation endpoint to override our default system prompt. Please note that our model was fine tuned to use our system prompt, so supplying a custom prompt may not always yield the best results.
Example output: