Text Moderation

Text Moderation Overview

Hive currently offers a suite of text moderation tools that help platforms detect different kinds of undesirable content including but not limited to: sexual, hate, violence, bullying, promotions, and links to external sites. Today, we have text moderation support for over 15 different languages; additionally, our text models are also trained to understand the semantic meaning of different emoji's.

Hive's text content moderation API takes a two-pronged approach to moderate text:

  1. A deep learning-based text classification model to moderate text based on semantic meaning.
  2. Rule-based character pattern-matching algorithms to flag specific words and phrases.

The text classification model is trained on a proprietary large corpus of labeled data across multiple domains (including but not limited to social media, chat, and livestreaming apps), and is able to interpret full sentences with linguistic subtleties. Pattern-matching algorithms will search sentences for a set of predefined patterns that are commonly associated with harmful content. We also offer users options to add their own rules to this.

Both our models and pattern-matching approaches are robust to and explicitly tested on character replacements, character duplication, leetspeak, misspellings, and other common adversarial user behaviors.

By default, it has a max input size of 1024 characters, which can be extended in special situations.

📘

NOTE:

A walkthrough of how to send text to our API and how you might use our model responses can be found in this guide.

Text Classification Models

Multilevel Model

Our multilevel models output different levels of moderation to empower our customers to make more refined moderation decisions. We offer four heads (sexual, hate, violence, bullying) that contain four classes each and one head (spam) that contains two classes. Our classes are ordered by severity ranging from level 3 (most severe) to level 0 (benign). If a certain head is not supported for a given language then you will receive a score of -1.

Sexual

  • 3: Intercourse, masturbation, porn, sex toys and genitalia
  • 2: Sexual intent, nudity and lingerie
  • 1: Informational statements that are sexual in nature, affectionate activities (kissing, hugging, etc.), flirting, pet names, relationship status, sexual insults and rejecting sexual advances
  • 0: the text does not contain any of the above

Hate

  • 3: Slurs, hate speech, promotion of hateful ideology
  • 2: Negative stereotypes or jokes, degrading comments, denouncing slurs, challenging a protected group's morality or identity, violence against religion
  • 1: Positive stereotypes, informational statements, reclaimed slurs, references to hateful ideology, immorality of protected group's rights
  • 0: the text does not contain any of the above

Violence

  • 3: Serious and realistic threats, mentions of past violence
  • 2: Calls for violence, destruction of property, calls for military action, calls for the death penalty outside a legal setting, mentions of self-harm/suicide
  • 1: Denouncing acts of violence, soft threats (kicking, punching, etc.), violence against non-human subjects, descriptions of violence, gun usage, abortion, self-defense, calls for capital punishment in a legal setting, destruction of small personal belongings, violent jokes
  • 0: the text does not contain any of the above

Bullying

  • 3: Slurs or profane descriptors toward specific individuals, encouraging suicide or severe self-harm, severe violent threats toward specific individuals
  • 2: Non-profane insults toward specific individuals, encouraging non-severe self-harm, non-severe violent threats toward specific individuals, silencing or exclusion
  • 1: Profanity in a non-bullying context, playful teasing, self-deprecation, reclaimed slurs, degrading a person's belongings, bullying toward organizations, denouncing bullying
  • 0: the text does not contain any of the above

Spam

  • 3: The text is intended to redirect a user to a different platform.
  • 0: The text does not include the above.

Promotions - Beta

  • 3: Asking for likes/follows/shares, advertising monthly newsletters/special promotions, asking for donations/payments, advertising products, selling pornography, giveaways
  • 0: The text does not include the above.

Supported Languages

The API returns the classified language for each request, using ISO 639-1 language codes. The response will indicate which classes were moderated based on the supported languages below. Classes for supported languages will return standard confidence scores, while classes for non-supported languages will return 0 under those class scores. Unsupported languages and non-language inputs — such as code or gibberish inputs — will be classified as an 'unsupported' language.

Sexual

Violence

Hate

Bullying

Promotions

Spam

English

Model

Model

Model

Model

Model

Pattern Match

Spanish

Model

Model

Model

'22 Q2

Hindi

Model

'22 Q2

Model

Model

French

Model

'22 Q2

German

Model

Italian

Pattern Match

Turkish

Pattern Match

Chinese

Pattern Match

Pattern Match

Pattern Match

Pattern Match

Russian

Pattern Match

Dutch

Pattern Match

Portuguese

Pattern Match

Model: '22 Q2

'22 Q2

Pattern Match

Model: '22 Q2

'22 Q2

Arabic

Model

'22 Q2

Korean

Pattern Match

Japanese

Pattern Match

Vietnamese

Pattern Match

Romanian

Pattern Match

Polish

Pattern Match

Pattern-Matching Algorithms

Hive's string pattern-matching algorithms return the filter type that was matched, the substring that matched the filters, as well as the position of the substring in the text input from the API request.

The string pattern-matching algorithms currently support the following categories:

Filter Type

Description

Profanity

  • Profane words and some phrases

Personal Identifiable Information (PII):

  • Email addresses
  • US phone numbers
  • Social Security Numbers
  • Mailing addresses
  • IP addresses
  • Age (beta)

Custom (User-Provided)

  • Custom user-provided text filters that get applied at each request, managed directly by users in the API dashboard

Adding Custom Classes to Pattern-Matching

Custom (user-provided) text filters can be configured and maintained directly by the API User from the API Dashboard. A user can have one or more custom class. Each custom class can have one or more custom text matches containing a list of words that should be matched under that class.

For example:

Custom Class

Custom Text Matches

Nationality

American, Canadian, Chinese, ...

Race

African American, Asian, Hispanic, ...

Users can select the Detect Subwords box if they would like any of their custom text matches to be detected whenever they are part of a larger string. For example, if the word class was part of the custom text matches, then the word classification would be flagged as well.

Character substitution is another feature that Hive offers which can be especially helpful when flagging leetspeak. A user can define the characters which they would like to be substituted in the left column of the Custom Character Substitution table. They will then create a list of comma separated replacements in the right column of the table.

For example:

Characters to be substituted

Substitution Strings (Comma separated)

o

0,(),@

s

$

Based on the previous example, we would flag any input that would match a string in Custom Text Matches after substituting all occurrences of 0, (), @ with o and all $ with s.

Adding Custom "Allowlists"

Custom allowlists permit users to whitelist text and prevent our models from flagging that content. Each custom allowlist can have one or more allowlist strings that our model will not analyze.

For example:

Allowlist Name

Allowlist Strings

URLs

facebook.com, google.com, youtube.com

Names

Andrew, Jack, Ben

Users can select the Detect Subwords box if they would like any of their allowlist strings to be ignored when part of a larger string. For example, if the word cup was part of allowlist strings, then cup would be ignored in cupcakes. If Detect Subwords is selected, another option called Allow Entire Subword will appear. If Allow Entire Subword is selected then in the scenario above the entire word cupcakes would be ignored.

Additional functionality is available for whitelisting urls. If Detect Inside URLs is selected any string that is part of the allowlist will be ignored as long as it is part of a URL. For example, if google.com is in the allowlist then it will be ignored when it is part of https://www.google.com/search?q=something. If Detect Inside URLs is selected, another box titled Allow Entire URL appears. When Allow Entire URL is selected, if a URL contains any words in the allowlist then the entire url will be ignored. Adding domains to an allowlist and selecting the Detect Inside URLs and Allow Entire URL options will prevent URLs with those domains from being flagged as spam.


What’s Next

See the API reference for more details on the API interface and response format.