Text Moderation Explanations

Text Moderation Explanations is a new feature that explains why a given text string was assigned a certain score by our text moderation tools. The API takes in three inputs: a text string, its class label, and the score it was assigned. The output is a text string that explains why the original input text was given that score relative to its class.

An example JSON response is shown below.

"response": {
    "Input": {
        "args": {
            "class": "bullying",
            "severity": "2"
        },
        "text": "You are the worst person I have ever met"
    },
    "output": [
        {
            "text": "This is a bullying 2 because it is a disparaging statement against and individual without profanity"
        }
    ]
}