Text Moderation

Text Moderation Overview

Hive currently offers a suite of text moderation tools that help platforms detect different kinds of undesirable content including but not limited to: sexual, hate, violence, bullying, promotions, and links to external sites. Today, we have text moderation support for over 15 different languages; additionally, our text models are also trained to understand the semantic meaning of different emoji's.

Hive's text content moderation API takes a two-pronged approach to moderate text:

  1. A deep learning-based text classification model to moderate text based on semantic meaning.
  2. Rule-based character pattern-matching algorithms to flag specific words and phrases.

The text classification model is trained on a proprietary large corpus of labeled data across multiple domains (including but not limited to social media, chat, and livestreaming apps), and is able to interpret full sentences with linguistic subtleties. Pattern-matching algorithms will search sentences for a set of predefined patterns that are commonly associated with harmful content. We also offer users options to add their own rules to this.

Both our models and pattern-matching approaches are robust to and explicitly tested on character replacements, character duplication, leetspeak, misspellings, and other common adversarial user behaviors.

By default, it has a max input size of 1024 characters, which can be extended in special situations.

📘

NOTE:

A walkthrough of how to send text to our API and how you might use our model responses can be found in this guide.

Text Classification Model

Multilevel Classes

Our multilevel classes output different levels of moderation to empower our customers to make more refined moderation decisions. We offer five heads (sexual, hate, violence, bullying, drugs) that contain four classes each and five heads (child exploitation, gibberish, spam, promotions, phone numbers) that contains two classes. Our classes are ordered by severity ranging from level 3 (most severe) to level 0 (benign). If a certain head is not supported for a given language then you will receive a score of -1.

  • 3: Intercourse, masturbation, porn, sex toys and genitalia
  • 2: Sexual intent, nudity and lingerie
  • 1: Informational statements that are sexual in nature, affectionate activities (kissing, hugging, etc.), flirting, pet names, relationship status, sexual insults and rejecting sexual advances
  • 0: the text does not contain any of the above

  • 3: Slurs, hate speech, promotion of hateful ideology
  • 2: Negative stereotypes or jokes, degrading comments, denouncing slurs, challenging a protected group's morality or identity, violence against religion
  • 1: Positive stereotypes, informational statements, reclaimed slurs, references to hateful ideology, immorality of protected group's rights
  • 0: the text does not contain any of the above

  • 3: Serious and realistic threats, mentions of past violence
  • 2: Calls for violence, destruction of property, calls for military action, calls for the death penalty outside a legal setting, mentions of self-harm/suicide
  • 1: Denouncing acts of violence, soft threats (kicking, punching, etc.), violence against non-human subjects, descriptions of violence, gun usage, abortion, self-defense, calls for capital punishment in a legal setting, destruction of small personal belongings, violent jokes
  • 0: the text does not contain any of the above

  • 3: Slurs or profane descriptors toward specific individuals, encouraging suicide or severe self-harm, severe violent threats toward specific individuals
  • 2: Non-profane insults toward specific individuals, encouraging non-severe self-harm, non-severe violent threats toward specific individuals, silencing or exclusion
  • 1: Profanity in a non-bullying context, playful teasing, self-deprecation, reclaimed slurs, degrading a person's belongings, bullying toward organizations, denouncing bullying
  • 0: the text does not contain any of the above

  • 3: Descriptions of the acquisition of drugs and text that explicitly promotes, advertises, or encourages drug use
  • 2: References to past drug acquisition or use as well as descriptions of recreational use that do not promote drugs to others
  • 1: Language around drugs that is neutral or informational, discouraging, or ambiguous in meaning
  • 0: The text does not contain any of the above

Binary Classes

  • 3: Asking for or trading child pornography(cp) or related links, mentioning proclivity for cp, identifiably underage users soliciting sex or pornography, roleplay involving children, mentions of sexual activity or sexual fetishes involving children
  • 0: the text does not include any of the above

  • 3: Content that contains a direct or indirect threat of physical violence to children in a school or school-related setting
  • 0: the text does not include any of the above

  • 3: keyboard spam and phrases or words that are completely incomprehensible (Ex: "kgvjbwklrgjb", "ef2$gt rgbu").
  • 0: The text does not include the above.

  • 3: The text is intended to redirect a user to a different platform, including email addresses, phone numbers, and certain links
  • 0: The text does not include the above OR includes a link to a allowlisted domain (i.e., popular, reputable sites).

  • 3: Asking for likes/follows/shares, advertising monthly newsletters/special promotions, asking for donations/payments, advertising products, selling pornography, giveaways
  • 0: The text does not include the above.

  • 3: The text includes a phone number
  • 0: The text does not include a phone number

Supported Languages

The API returns the classified language for each request, using ISO 639-1 language codes. The response will indicate which classes were moderated based on the supported languages below. Classes for supported languages will return standard confidence scores, while classes for non-supported languages will return -1 under those class scores. Unsupported languages and non-language inputs — such as code or gibberish inputs — will be classified as an 'unsupported' language.

SexualViolenceHateBullyingPromotionsSpamPhone NumbersChild ExploitationChild SafetyDrugsGibberish
EnglishModelModelModelModelModelPattern MatchModelModelModelModelModel
SpanishModelModelModelModel-------
HindiModelModelModelModel-------
FrenchModelModelModelModel-------
PortugueseModelModelModelModel-------
ArabicModelModel'23 Q1Model-------
GermanModel-'23 Q1Model-------
ItalianModel----------
TurkishPattern Match----------
ChineseModelPattern MatchPattern MatchPattern Match-------
RussianPattern Match----------
DutchPattern Match----------
KoreanPattern Match----------
JapaneseModel----------
VietnamesePattern Match----------
RomanianPattern Match----------
PolishPattern Match----------

Pattern-Matching Algorithms

Hive's string pattern-matching algorithms return the filter type that was matched, the substring that matched the filters, as well as the position of the substring in the text input from the API request.

The string pattern-matching algorithms currently support the following categories:

Filter TypeDescription
Profanity- Profane words and some phrases ( Language support: English, Spanish, French, German, Hindi, Portuguese, Vietnamese, Italian )
Personal Identifiable Information (PII):- Email addresses
- US phone numbers
- International Phone Numbers (currently India only)
- Social Security Numbers
- Mailing addresses
- IP addresses
- Age (beta)
Custom (User-Provided)- Custom user-provided text filters that get applied at each request, managed directly by users in the API dashboard

Adding Custom Classes to Pattern-Matching

Custom (user-provided) text filters can be configured and maintained directly by the API User from the API Dashboard. A user can have one or more custom class. Each custom class can have one or more custom text matches containing a list of words that should be matched under that class.

For example:

Custom ClassCustom Text Matches
NationalityAmerican, Canadian, Chinese, ...
RaceAfrican American, Asian, Hispanic, ...

Users can select the Detect Subwords box if they would like any of their custom text matches to be detected whenever they are part of a larger string. For example, if the word class was part of the custom text matches, then the word classification would be flagged as well.

Character substitution is another feature that Hive offers which can be especially helpful when flagging leetspeak. A user can define the characters which they would like to be substituted in the left column of the Custom Character Substitution table. They will then create a list of comma separated replacements in the right column of the table.

For example:

Characters to be substitutedSubstitution Strings (Comma separated)
o0,(),@
s$

Based on the previous example, we would flag any input that would match a string in Custom Text Matches after substituting all occurrences of 0, (), @ with o and all $ with s.

Adding Custom Allowlists

Custom allowlists permit users to prevent our models from flagging certain text content. Each custom allowlist can have one or more allowlist strings that our model will not analyze.

For example:

Allowlist NameAllowlist Strings
URLsfacebook.com, google.com, youtube.com
NamesAndrew, Jack, Ben

Users can select the Detect Subwords box if they would like any of their allowlist strings to be ignored when part of a larger string. For example, if the word cup was part of allowlist strings, then cup would be ignored in cupcakes. If Detect Subwords is selected, another option called Allow Entire Subword will appear. If Allow Entire Subword is selected then in the scenario above the entire word cupcakes would be ignored.

📘

NOTE:

Many widely-used, reputable websites are automatically allowlisted by default. However, you can follow the steps below to allowlist additional domains as needed.

Selecting Detect Inside URLs will enable the text filter to ignore any string that is part of the allowlist even if it part of a URL. For example, if google.com is in the allowlist then it will be ignored when it is part of https://www.google.com/search?q=something. If Detect Inside URLs is selected, another box titled Allow Entire URL appears. When Allow Entire URL is selected, if a URL contains any words in the allowlist then the entire URL will be ignored. Adding domains to an allowlist and selecting the Detect Inside URLs and Allow Entire URL options will prevent URLs with those domains from being flagged as spam.


What’s Next

See the API reference for more details on the API interface and response format.