Text Moderation

Text Moderation Overview

Hive's text content moderation API takes a two-pronged approach to moderate text:

  1. A deep learning-based text classification model to moderate text based on semantic meaning.
  2. Rule-based character pattern-matching algorithms to flag specific words and phrases.

The text classification model is trained on a proprietary large corpus of labeled data across multiple domains (including but not limited to social media, chat, and livestreaming apps), and is able to interpret full sentences with linguistic subtleties. Pattern-matching algorithms will search sentences for a set of predefined patterns that are commonly associated with harmful content. We also offer users options to add their own rules to this.

Both our models and pattern-matching approaches are robust to and explicitly tested on character replacements, character duplication, leetspeak, misspellings, and other common adversarial user behaviors.

Text Classification Models

Multilevel Model

Our multilevel models output different levels of moderation to empower our customers to make more refined moderation decisions. We offer four heads (sexual, hate, violence, bullying) that contain four classes each and one head (spam) that contains two classes. Our classes are ordered by severity ranging from level 3 (most severe) to level 0 (benign). If a certain head is not supported for a given language then you will receive a score of -1.

Sexual

  • 3: Intercourse, masturbation, porn, sex toys and genitalia
  • 2: Sexual intent, nudity and lingerie
  • 1: Informational statements that are sexual in nature, affectionate activities (kissing, hugging, etc.), flirting, pet names, relationship status, sexual insults and rejecting sexual advances
  • 0: the text does not contain any of the above

Hate

  • 3: Slurs, hate speech, promotion of hateful ideology
  • 2: Negative stereotypes or jokes, degrading comments, denouncing slurs, challenging a protected group's morality or identity, violence against religion
  • 1: Positive stereotypes, informational statements, reclaimed slurs, references to hateful ideology, immorality of protected group's rights
  • 0: the text does not contain any of the above

Violence

  • 3: Serious and realistic threats, mentions of past violence
  • 2: Calls for violence, destruction of property, calls for military action, calls for the death penalty outside a legal setting, mentions of self-harm/suicide
  • 1: Denouncing acts of violence, soft threats (kicking, punching, etc.), violence against non-human subjects, descriptions of violence, gun usage, abortion, self-defense, calls for capital punishment in a legal setting, destruction of small personal belongings, violent jokes
  • 0: the text does not contain any of the above

Bullying

  • 3: Slurs or dehumanizing insults, encouraging suicide
  • 2: Derogatory or profane descriptors, silencing or exclusion,
  • 1: Soft insults, self deprecation, reclaimed slurs, degrading a person's belongings, denouncing bullying
  • 0: the text does not contain any of the above

Spam

  • 3: The text is intended to redirect a user to a different platform.
  • 0: The text does not include the above.

Promotion - Beta

  • 3: Asking for likes/follows/shares, advertising monthly newsletters/special promotions, asking for donations/payments, advertising products, selling pornography, giveaways
  • 0: The text does not include the above.

Binary Model (Deprecated)

Classification models can be multi-headed, where each group of mutually exclusive model classes belong to a single model head. For example, when a text is run through Hive's text moderation model, one head might classify violent vs. non-violent text, while another classifies sexually explicit vs. non-sexual text.

This concept is illustrated below. This imaginary model has two heads:

  • Sexual text classification: sexual, suggestive, no_sexual
  • Violent text classification: violence, no_violence

The confidence scores for each model head would sum to 1.

Hive’s text classification model is a multi-head classifier that determines the language of the input, as well as whether the text falls into sexual, hateful, violent, bullying, or spam categories. By default, it has a max input size of 1024 characters, which can be extended in special situations.

Sexual

Sexual Head:

  • sexual: the text is sexually explicit (discussion of genitals, sexual intercourse, sex toys, etc.)
  • suggestive: the text is sexually suggestive (pet names, compliments on attractiveness, etc.)
  • no_sexual: the text does not include the above.

Violence

Violence Head:

  • violence: the text contains direct violent threats in the second person.
  • self_harm (beta version): the text depicts harming oneself.
  • no_violence: the text does not include the above.

Incitement

Incitement Head:

  • yes_incitement: the text incites or encourages acts of violence.
  • no_incitement: the text does not incite or promote acts of violence.

Bullying

Bullying Head:

  • yes_bullying: the text is intended to frighten, intimidate, or hurt another individual (physically, mentally, or emotionally).
  • no_bullying: the text does not include the above.

Hate

Hate Head:

  • yes_hate: the text is hateful towards another person or group based on protected attributes, such as religion, nationality, race, sexual orientation, gender, sex, etc.
  • no_hate: the text does not include the above.

Spam

Spam Head:

  • yes_spam: the text is intended to redirect a user off the website to a different platform.
  • no_spam: the text does not include the above.

Supported Languages

The API returns the classified language for each request, using ISO 639-1 language codes. The response will indicate which classes were moderated based on the supported languages below. Classes for supported languages will return standard confidence scores, while classes for non-supported languages will return 0 under those class scores. Unsupported languages and non-language inputs — such as code or gibberish inputs — will be classified as an 'unsupported' language.

Sexual

Violence

Hate

Bullying

Promotion

Spam

English

Model + Pattern Match

Model + Pattern Match

Model + Pattern Match

Model + Pattern Match

Model

Pattern Match

Spanish

Model + Pattern Match

'22 Q1

Pattern Match

'22 Q1

Hindi

'22 Q1

'22 Q1

'22 Q1

'22 Q1

French

Model + Pattern Match

German

Model + Pattern Match

Italian

Pattern Match

Turkish

Pattern Match

Mandarin

Pattern Match

Russian

Pattern Match

Dutch

Pattern Match

Portuguese

Pattern Match

'22 Q1

Pattern Match

'22 Q1

Arabic

Model + Pattern Match

Korean

Pattern Match

Japanese

Pattern Match

Vietnamese

Pattern Match

Romanian

Pattern Match

Polish

Pattern Match

Choosing Thresholds

For each of the classes mentioned above, you will need to set thresholds to decide when to take action based on our model results. For optimum results, a proper threshold analysis on a natural distribution of your data is recommended (for more on this please contact Hive at the email below). Generally, a model confidence score >.85 is a good place to start to flag text for any class of interest. For questions on best practices, please message your point of contact at Hive or send a message to [email protected] to contact our API team directly.

Pattern-Matching Algorithms

Hive's string pattern-matching algorithms return the filter type that was matched, the substring that matched the filters, as well as the position of the substring in the text input from the API request.

The string pattern-matching algorithms currently support the following categories:

Filter Type

Description

Profanity

  • Profane words and some phrases

Personal Identifiable Information (PII):

  • Email addresses
  • Phone numbers
  • Social Security Numbers
  • Mailing addresses
  • IP addresses

Custom (User-Provided)

  • Custom user-provided text filters that get applied at each request, managed directly by users in the API dashboard

Adding Custom Classes to Pattern-Matching

Custom (user-provided) text filters can be configured and maintained directly by the API User from the API Dashboard. A user can have one or more custom class. Each custom class can have one or more custom text matches containing a list of words that should be matched under that class.

For example:

Custom Class

Custom Text Matches

Nationality

American, Canadian, Chinese, ...

Race

African American, Asian, Hispanic, ...

Users can select the Detect Subwords box if they would like any of their custom text matches to be detected whenever they are part of a larger string. For example, if the word class was part of the custom text matches, then the word classification would be flagged as well.

Character substitution is another feature that Hive offers which can be especially helpful when flagging leetspeak. A user can define the characters which they would like to be substituted in the left column of the Custom Character Substitution table. They will then create a list of comma separated replacements in the right column of the table.

For example:

Characters to be substituted

Substitution Strings (Comma separated)

o

0,(),@

s

$

Based on the previous example, we would flag any input that would match a string in Custom Text Matches after substituting all occurrences of 0, (), @ with o and all $ with s.

Adding Custom "Allowlists"

Custom allowlists permit users to whitelist text and prevent our models from flagging that content. Each custom allowlist can have one or more allowlist strings that our model will not analyze.

For example:

Allowlist Name

Allowlist Strings

URLs

facebook.com, google.com, youtube.com

Names

Andrew, Jack, Ben

Users can select the Detect Subwords box if they would like any of their allowlist strings to be ignored when part of a larger string. For example, if the word cup was part of allowlist strings, then cup would be ignored in cupcakes. If Detect Subwords is selected, another option called Allow Entire Subword will appear. If Allow Entire Subword is selected then in the scenario above the entire word cupcakes would be ignored.

Additional functionality is available for whitelisting urls. If Detect Inside URLs is selected any string that is part of the allowlist will be ignored as long as it is part of a URL. For example, if google.com is in the allowlist then it will be ignored when it is part of https://www.google.com/search?q=something. If Detect Inside URLs is selected, another box titled Allow Entire URL appears. When Allow Entire URL is selected, if a URL contains any words in the allowlist then the entire url will be ignored. Adding domains to an allowlist and selecting the Detect Inside URLs and Allow Entire URL options will prevent URLs with those domains from being flagged as spam.


What’s Next

See the API reference for more details on the API interface and response format.