Text Moderation
Text Moderation Overview
Hive currently offers a suite of text moderation tools that help platforms detect different kinds of undesirable content including but not limited to: sexual, hate, violence, bullying, promotions, and links to external sites. Today, we have text moderation support for over 15 different languages; additionally, our text models are also trained to understand the semantic meaning of different emoji's.
Hive's text content moderation API takes a two-pronged approach to moderate text:
- A deep learning-based text classification model to moderate text based on semantic meaning.
- Rule-based character pattern-matching algorithms to flag specific words and phrases.
The text classification model is trained on a proprietary large corpus of labeled data across multiple domains (including but not limited to social media, chat, and livestreaming apps), and is able to interpret full sentences with linguistic subtleties. Pattern-matching algorithms will search sentences for a set of predefined patterns that are commonly associated with harmful content. We also offer users options to add their own rules to this.
Both our models and pattern-matching approaches are robust to and explicitly tested on character replacements, character duplication, leetspeak, misspellings, and other common adversarial user behaviors.
By default, it has a max input size of 1024 characters, which can be extended in special situations.
NOTE:
A walkthrough of how to send text to our API and how you might use our model responses can be found in this guide.
Text Classification Models
Multilevel Model
Our multilevel models output different levels of moderation to empower our customers to make more refined moderation decisions. We offer four heads (sexual, hate, violence, bullying) that contain four classes each and one head (spam) that contains two classes. Our classes are ordered by severity ranging from level 3 (most severe) to level 0 (benign). If a certain head is not supported for a given language then you will receive a score of -1.
Sexual
- 3: Intercourse, masturbation, porn, sex toys and genitalia
- 2: Sexual intent, nudity and lingerie
- 1: Informational statements that are sexual in nature, affectionate activities (kissing, hugging, etc.), flirting, pet names, relationship status, sexual insults and rejecting sexual advances
- 0: the text does not contain any of the above
Hate
- 3: Slurs, hate speech, promotion of hateful ideology
- 2: Negative stereotypes or jokes, degrading comments, denouncing slurs, challenging a protected group's morality or identity, violence against religion
- 1: Positive stereotypes, informational statements, reclaimed slurs, references to hateful ideology, immorality of protected group's rights
- 0: the text does not contain any of the above
Violence
- 3: Serious and realistic threats, mentions of past violence
- 2: Calls for violence, destruction of property, calls for military action, calls for the death penalty outside a legal setting, mentions of self-harm/suicide
- 1: Denouncing acts of violence, soft threats (kicking, punching, etc.), violence against non-human subjects, descriptions of violence, gun usage, abortion, self-defense, calls for capital punishment in a legal setting, destruction of small personal belongings, violent jokes
- 0: the text does not contain any of the above
Bullying
- 3: Slurs or profane descriptors toward specific individuals, encouraging suicide or severe self-harm, severe violent threats toward specific individuals
- 2: Non-profane insults toward specific individuals, encouraging non-severe self-harm, non-severe violent threats toward specific individuals, silencing or exclusion
- 1: Profanity in a non-bullying context, playful teasing, self-deprecation, reclaimed slurs, degrading a person's belongings, bullying toward organizations, denouncing bullying
- 0: the text does not contain any of the above
Spam
- 3: The text is intended to redirect a user to a different platform.
- 0: The text does not include the above.
Promotions - Beta
- 3: Asking for likes/follows/shares, advertising monthly newsletters/special promotions, asking for donations/payments, advertising products, selling pornography, giveaways
- 0: The text does not include the above.
Supported Languages
The API returns the classified language for each request, using ISO 639-1 language codes. The response will indicate which classes were moderated based on the supported languages below. Classes for supported languages will return standard confidence scores, while classes for non-supported languages will return 0 under those class scores. Unsupported languages and non-language inputs — such as code or gibberish inputs — will be classified as an 'unsupported' language.
Pattern-Matching Algorithms
Hive's string pattern-matching algorithms return the filter type that was matched, the substring that matched the filters, as well as the position of the substring in the text input from the API request.
The string pattern-matching algorithms currently support the following categories:
Filter Type | Description |
---|---|
Profanity |
|
Personal Identifiable Information (PII): |
|
Custom (User-Provided) |
|
Adding Custom Classes to Pattern-Matching
Custom (user-provided) text filters can be configured and maintained directly by the API User from the API Dashboard. A user can have one or more custom class. Each custom class can have one or more custom text matches containing a list of words that should be matched under that class.
For example:
Custom Class | Custom Text Matches |
---|---|
Nationality | American, Canadian, Chinese, ... |
Race | African American, Asian, Hispanic, ... |
Users can select the Detect Subwords box if they would like any of their custom text matches to be detected whenever they are part of a larger string. For example, if the word class was part of the custom text matches, then the word classification would be flagged as well.
Character substitution is another feature that Hive offers which can be especially helpful when flagging leetspeak. A user can define the characters which they would like to be substituted in the left column of the Custom Character Substitution table. They will then create a list of comma separated replacements in the right column of the table.
For example:
Characters to be substituted | Substitution Strings (Comma separated) |
---|---|
o | 0,(),@ |
s | $ |
Based on the previous example, we would flag any input that would match a string in Custom Text Matches after substituting all occurrences of 0, (), @ with o and all $ with s.
Adding Custom "Allowlists"
Custom allowlists permit users to whitelist text and prevent our models from flagging that content. Each custom allowlist can have one or more allowlist strings that our model will not analyze.
For example:
Allowlist Name | Allowlist Strings |
---|---|
URLs | facebook.com, google.com, youtube.com |
Names | Andrew, Jack, Ben |
Users can select the Detect Subwords box if they would like any of their allowlist strings to be ignored when part of a larger string. For example, if the word cup was part of allowlist strings, then cup would be ignored in cupcakes. If Detect Subwords is selected, another option called Allow Entire Subword will appear. If Allow Entire Subword is selected then in the scenario above the entire word cupcakes would be ignored.
Additional functionality is available for whitelisting urls. If Detect Inside URLs is selected any string that is part of the allowlist will be ignored as long as it is part of a URL. For example, if google.com is in the allowlist then it will be ignored when it is part of https://www.google.com/search?q=something. If Detect Inside URLs is selected, another box titled Allow Entire URL appears. When Allow Entire URL is selected, if a URL contains any words in the allowlist then the entire url will be ignored. Adding domains to an allowlist and selecting the Detect Inside URLs and Allow Entire URL options will prevent URLs with those domains from being flagged as spam.
Updated 6 days ago