OpenAI's Moderation API

In my previous post, I mentioned that one of the ways to protect against prompt injection attacks is through the use of Moderation API. In this post, I would like to briefly delve into what exactly this API is all about.

Introduction

Moderation API is a  free tool that allows us to verify the content we send to OpenAI, ensuring compliance with their API usage policies. By leveraging the Moderation API's responses, we gain the ability to easily identify prohibited content and take preemptive actions like blocking or filtering such material.

But why should we verify content? Well, if we ignore this step it can have serious consequences:

  • In the best-case scenario, OpenAI may ask us to make adjustments to our application to align with their guidelines.
  • In cases of repeated or severe violations, OpenAI has the authority to block or even permanently close our account.

This holds particular significance for applications where users themselves can define the content submitted to OpenAI. A single malicious or even unaware user has the potential to disrupt OpenAI services, leading to unwanted consequences.

Usage

Query moderation is very easy. OpenAI carries most of the workload, as we make a call to the  /moderations endpoint:

curl https://api.openai.com/v1/moderations \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{"input": "I hate pizza with pineapple"}'


In response, we will receive a JSON that describes whether the submitted content violates any of the guidelines:

{
   "id":" 150",
   "model": "text-moderation-001",
   "results": [
      {
         "flagged": false,
         "categories":{
            "hate": false,
            "hate/threatening": false,
            "self-harm": false,
            "sexual": false,
            "sexual/minors": false,
            "violence": false,
            "violence/graphic": false
         },
         "category_scores": {
            "hate": 7.161974622249545e-07,
            "hate/threatening": 2.979852475881728e-10,
            "self-harm": 1.0550021301014567e-08,
            "sexual": 7.481294801436889e-07,
            "sexual/minors": 3.7802903030126345e-09,
            "violence": 1.2378469364193734e-05,
            "violence/graphic": 3.236694965380593e-07
         }
      }
   ]
}

The most crucial field at play is the flagged flag, which informs us whether the submitted content violates any guidelines. When everything is in order, the flag takes on the value of false. However, if there is an issue, the flag's value becomes true.

If the submitted content breaches OpenAI's rules, the problematic category will be marked as true in the categories field. Currently, OpenAI distinguishes four general categories: hate, self-harm, sexual, and violence. Some categories may even have subcategories, such as violence/graphic.

Furthermore, OpenAI provides the category_scores field, indicating the extent to which our content has been classified within each category. The values for category_scores range from 0 to 1, where a higher value signifies greater confidence from the classification model that the content violates OpenAI's guidelines.

Pros of the Moderation API

Moderation API brings a lot of advantages:

  • Simplicity - it's as easy as sending a query to the /moderations endpoint and reading the value from the flagged field.
  • This endpoint is entirely free of charge.
  • OpenAI remains committed to the continuous development of their content classification model. Which means we don't have to worry about maintaining our own moderation tools.

Pitfalls of  the Moderation API

At the moment, OpenAI's support for languages other than English is limited. During my tests, for the Polish language, OpenAI detected rule violations but categorized them in a rather broad manner. For instance, when presented with the sentence:

"Jeśli jeszcze raz będę musiał poprawiać ten kod, to się zabiję"

The model classified it as 88% violence. In contrast, for the English version:

"If I have to fix this code again, I will kill myself."

I obtained a more accurate result: 94% self-harm.

In both cases, the rule violation was detected, but the outcome can vary significantly for different examples. Therefore, when dealing with languages other than English, it's important to keep in mind that the results may not be as precise.

Tools

During the reaserch, I found this tool that allows to play with moderation API.