Detailed Class Descriptions – Text Moderation

Overview

This page gives a more comprehensive description of Hive’s Text Moderation classes. If you need more details or have further questions on certain types of content after reading the main Text Moderation page, look here. We’ll describe as clearly as possible and provide examples of what is covered by each model head, including multi-level classes if applicable.

All platforms have different moderation requirements and risk sensitivities, so we recommend that you consult these descriptions carefully as you decide which classes to moderate (and at what severity).

📘

NOTE:

To determine which classes and severity levels cover specific types of text, it may be helpful to search this page (Crtl/Cmd + F) with relevant terms. You can also use the sidebar on the right to navigate to classes of interest.

General Notes

Before looking at subject-matter breakdowns for each class, it’s helpful to understand the following:

  1. Hive’s text classifier is multi-headed. Classifications from each model head (e.g., sexual, hate, violence, etc.) are made independently and returned together. If a message scores highly in multiple classes, it meets our definition of each class.
  2. Some model heads define four classes that capture different severity levels. In the API response, these multi-level classes are represented by severity scores spanning integer values from 0 (benign) to 3 (most severe). Other model heads – spam, child exploitation, promotions, and phone numbers – are binary. For binary model heads, content will be classified as either Level 3 or Level 0. In these cases, Level 3 simply means that the relevant content is present; it is not intended to be an indicator of severity.
  3. Just because a message is classified as Level 0 for one model head does not mean the content is clean. For this reason, we will describe the Level 0 classes as a non-exhaustive list of subject that is not captured by the higher severity classes but that might be helpful to distinguish between what is and is not flagged (e.g., borderline content, content captured by other classes).
  4. Usernames that use intelligible words and phrases (including character replacements) generally follow the same rules as messages and other text strings.
  5. The lists given in these subject matter breakdowns are not exhaustive, and the context around how a particular word or phrase is used can have a significant impact on the classification. We’ll provide descriptive examples and rules around contextual factors where relevant, but you can also use our web demo to get a rough impression for other cases interest.

Sexual Model Head

This is the main model head for flagging sexual messages and text content like solicitation, descriptions of sexual activity, and mentions of pornography. Slang that has a sexual connotation and certain emojis will also be flagged if there is enough context to infer sexual meaning.

Here’s a breakdown of subject matter captured by each class/severity level.

Level 3: Sexually Explicit

Captures text that explicitly references sexual activity, genitalia, and pornography. Text will also be classified as a level 3 if it explicitly refers to illegal or non-consensual activity. Specifically, level 3 flags the following cases:

-Phrases that explicitly or indirectly reference sexual intercourse or sexual activity with high confidence, including:

  • Sexual intercourse or penetration of any kind
  • Oral sex
  • Anal sex
  • Orgasm or sexual bodily fluids

-Phrases that explicitly reference or describe an individual masturbating, including:

  • Fingering
  • Jerking off and similar terms
  • Orgasm and sexual bodily fluids (see above)
  • Other references to masturbation e.g., brb gonna go stroke my wood
  • Note: Mentioning masturbation in an informational context without reference to a particular individual or act is classified as Level 1

-Phrases that mention or discuss pornography, porn websites, porn actors, or cam sites

-Phrases that mention or describe the use of sex toys by a person, including:

  • Dildos, fleshlights, vibrators, buttplugs, etc.
  • Common objects used as sex toys if there is context to conclude sexual intent
  • Note: Sex toys that are simply described without referring to individual use are classified as Level 1.

-Phrases that explicitly suggest sexual desire or turns-ons, including:

  • “Horny” and similar words
  • “Hard,” “wet,” and similar words, if context is sufficient to infer sexual meaning
  • “MILF” and variants

Phrases that include explicit mentions of genitals, sexual body parts, and sexual bodily fluids (including slang and synonyms)

  • This includes emojis commonly used to substitute for words describing body parts and sexual acts
  • “Butt” and synonyms are classified as Level 3 if used in the context of anal sex or penetration, typically Level 2 otherwise.
  • “Dick,” “pussy,” “asshole,” etc. are classified as Level 1 if used as an insult or in non-sexual contexts. Similarly, if “balls” is used as a colloquialism for courage, this would also be classified as Level 1 (see Sexual Level 1 description for examples)

-Phrases that use “whore” or “slut” to describe a person, context does not necessarily need to be sexual

-Phrases that explicitly describe, condone, promote, or threaten non-consensual or unlawful sexual activity with enough context to infer sexual meaning, including:

  • Rape and Molestation
  • Incest
  • Pedophilia
  • Prostitution
  • Bestiality

Level 2: Sexual

This class captures text that references sexual content like nudity and suggestive messages that are not necessarily explicit or use tamer language. Specifically, the level 2 class will flag:

-Phrases that explicitly mention nudity, but do not reference genitalia, breasts, etc, including:

  • References to being nude or undressing
  • Requesting or mentioning sending or receiving nude photos (note: the reference to nude photos must be explicit)
  • References to strippers or strip clubs

-Phrases that reference foreplay or kissing, licking, or sucking non-genital body parts (feet, face, neck, inner thighs, etc.)

-Phrases that implicitly suggest sexual desire or sexual intent ("turned on," "aroused," etc.)

-Phrases that comment suggestively on frequently sexualized body parts of a particular person (e.g., usually accompanied by names or pronouns like “your,” “her,” “his,” etc.).

  • This always includes “butt,” “ass,” and relevant synonyms and emojis, but can also include legs or lips if the accompanying language is suggestive enough, e.g., your legs are so sexy
  • Note: phrases that do the same for genitals and breasts are classified as Level 3

-Phrases that reference lingerie or revealing underwear (jockstraps, thongs, panties, etc.) if mentioned in connection with a wearer or human subject (usually accompanied by names or pronouns)

  • If lingerie or underwear is simply described with no clear reference to a person (e.g., a description of a product or general comment), this would be instead be classified as Level 1, e.g., that store sells really comfortable bras

-Phrases that use “ho,” “hoe,” or “thot” to refer to a person or group of people

Level 1: Potentially Sexual

This class captures references to sexuality and relationships that are not suggestive or explicit, like pet names, relationship status, and non-sexual compliments on appearance. Depending on context, Level 1 can also include non-sexual uses of potentially sexual words (e.g., profanity, insults), as well as sexual subject matter discussed in an informational, medical, or educational setting.

📘

USAGE:

Most of the content flagged by the Level 1 class would likely be considered benign by adults, but may not be appropriate for children.

Specifically, the level 1 class will flag:

-Phrases that mention affectionate activities besides sex, masturbation, and foreplay, including:

  • Kissing, making out
  • Cuddling
  • Hooking up if there is not enough context to tell that the phrase refers to sex
  • Note: If “hook up” or similar phrases clearly refer to sex based on context, this would instead be classified as Level 3. For example: we hooked up last night but the condom broke

-Phrases that include innocent or non-sexual flirting. This includes:

  • Compliments on a person’s physical appearance or non-sexualized attributes, e.g., your dimples are really cute
  • Lovey or sweet messages between romantic partners
  • Phrases that include pet names in a flirtatious way ("cutie," "babe," "sweetie")
  • Note: “Sexy” will be classified as at least Level 1, even when used to describe objects or non-sexual actions, e.g., that was a sexy shot

-Phrases that discuss level of sexual experience or relationship status

-Phrases that mention sexuality or sexual orientation in a casual, non-suggestive way

-Non-sexual uses of words that are otherwise be sexual, including:

  • “Dick,” “cock,” “pussy,” and similar terms when used as an insult or to convey weakness or disrespect
  • “Fucker” or “motherfucker” when used an insult or to describe a person in a non-sexual context
  • “Fuck me/him/her/them” if used without additional/sexual context
  • “Balls” when used as slang for courage
  • “Slut” or “whore” when used to convey attention-seeking behavior or fondness for non-human and non-sexual objects e.g., i hate all these self absorbed social media whores, i’m such a slut for chipotles queso
  • “Porn” when used in non-sexual contexts, like describing other types of photos/videos (e.g., Earth porn, food porn)

-Potentially sexual subject manner if discussed in informational, educational, or medical context, including:

  • Medical procedures and conditions, STDs, certain medications (e.g., Viagra, Plan B), certain types of contraception (e.g., condoms)
  • Discussions of human anatomy, sex, or reproductive health in medical contexts
  • Discussion of strip clubs, pornography, and sex workers in an informational or news context, e.g., Porn stars made over $1.2B in pay in 2019
  • Discussion of sperm banks, sperm donations, and animal sperm in breeding contexts
  • Discussion of sex crimes or illegal sex acts in a news context if discussed purely matter-of-fact and without actual description of sexual activity, e.g., sexual assault has become a growing problem on college campuses

-Sexual terms used as a figure of speech and not directed towards a human subject, e.g., I got absolutely railed by that exam

-Instances where the speaker is:

  • Rejecting a sexual advance
  • Rejecting, expressing disapproval, or non-consent on any mentioned sexual subject matter, e.g., i’m not gonna send nudes lol

Level 0: Non-Sexual

Messages that do not contain sexual or suggestive content or language are classified as Level 0 (benign). This includes some cases where potentially sexual words are used as a figure of speech, in other contexts, or when there is not enough context to infer sexual meaning or intent. For clarity, content is not flagged as Level 1, Level 2, or Level 3 includes:

  • Phrases that mention hugging, non-romantic, or parental affection
  • “Breast” when referring to food (e.g., turkey breast, chicken breast)
  • Internal reproductive organs (e.g., uterus, ovaries)
  • Pregnancy
  • The word “sex” when used is the context of gender, e.g., my sex is female
  • “Beautiful,” “hot,” “cute” etc. used to describe or compliment objects, things, or oneself e.g., I look HOT today
  • Insults or figures of speech using (non-sexual body) parts
  • “Fuck” used as profanity or as an insult without sexual meaning (this is flagged as profanity and/or bullying instead)
  • “Naked” or “nude” when not describing a person or used metaphorically, e.g., I feel naked without my headphones
  • “Suck” used descriptively or as an insult, e.g., You suck, their pizza fucking sucks, suck it!

Hate Model Head

This is our main model for flagging hate speech directed at minorities and protected groups, including slang, slurs, and hateful use of emojis (e.g., different skin tones) depending on context.

🚧

NOTE:

Hateful or discriminatory language targeted at a specific individual or a specific group of people (e.g., a group of friends or classmates) is typically flagged as bullying instead of or in addition to hate.

Level 3: Hate Speech

This class flags language that is overtly discriminatory, threatening, directed, and/or violent. Slurs, demeaning terms, comparisons of minority groups to animals, and hateful ideology are all categorized as Level 3. The scope of content covered by Level 3 aligns roughly with the legal definition of hate speech against protected groups (including racial minorities, religious minorities, LGBT+ people, women), although slurs or demeaning terms directed at white people are also covered. Immigration status and nationality is also considered sensitive. Specifically, the Level 3 class will flag:

  • Phrases that call for or justify violence against a particular group
  • Phrases that describe a particular group as physically or morally inferior
  • Phrases that describe a particular group as criminals
  • Phrases that refer to a particular group as animals, sub-human, or non-human. Also applies to words like “trash,” “scum,” and “savages”
  • Slurs of any kind, including variants, intentional misspellings, and character substitutions
  • Support or promotion of hateful ideology or hate groups, including:
  1. Content that speaks positively about symbols, slogans, gestures, groups, policies, or individuals that are explicitly hateful (Nazism, the Ku Klux Klan and other white power groups, hateful policies or atrocities toward protected groups)
  2. Content that denies or makes light of well-documented atrocities or violent events against minority groups or individuals (Holocaust or other genocide denial, hate crimes, lynchings, etc.)

Level 2: Hateful

This class flags language that is a bit softer than hate speech but might exclude, silence, intimidate, or dissuade individuals from participating or feeling safe in a discussion or online community based on race, gender, sexuality, or religion. Specifically, the Level 2 class flags:

  • Phrases that perpetuate negative stereotypes, negative descriptions of different cultural practices, and references to specific physical traits used to stereotype groups
  • Statements that target protected groups on the basis of religious or moral beliefs, e.g., homosexuality is a sin
  • Certain slurs (e.g., "gay," "retarded") when used as casual insults rather than to target someone in the relevant protected group
  • Denying an individual’s gender identity (references to trans, non-binary, genderfluid etc. identities must be explicit)
  • Advocating for the removal of a protected group’s civil rights or legal protections
  • Promoting, condoning, or justifying exclusion, discrimination, or inequality on the basis of protected characteristics, e.g., it’s wrong for women to be the breadwinner, that should be a man’s job
  • Advocating violence against or destruction of religious texts, religious symbols, or places of worship
  • Language that is not explicitly hate speech, but degrades or implies lesser status of a protected group
  • Language that denounces, rejects, or criticizes the slurs flagged by Level 3 (if the message includes the word itself)

Level 1: Controversial

Generally, this class flags language around protected groups that might imply prejudice, bring up negative connotations, or provoke controversy or conflict. Content flagged as Level 1 would also generally be considered inappropriate for children. Generally, Level 1 flags:

  • Statements challenging the validity of minority status, e.g., asians might as well be white

  • Neutral or negative references to hateful ideology, hate symbols, slogans, gestures, groups, or policies that are understood as hateful, e.g., Police are the 21st century KKK. Note: Statements that express support are instead classified as Level 3 (see above)

  • Controversial topics mentioned in connection with a particular group (even in an informational manner), including crime, politics, voting, morality, intellect, educational achievement, work ethic, health, incarceration, privilege, reproductive health, colonization, cultural or religious attire, censorship & free speech, and civil rights. For example: gay men are 30 times more likely to get AIDS, Blacks make up half the prison population,white privilege doesn’t exist

  • Statements that defend against negative stereotypes or denounce slurs

  • Statements that imply preference for some groups over others, e.g.,I’m not attracted to Asians tbh

  • Statements that reference discrimination or inequities faced by a particular group, e.g., he’s only being attacked because he’s black

  • Controversial or vulgar terms related to a protected group when used in a non-hateful context and/or not directed toward an individual. For example, using “gay” or “retarded” with a negative connotation in references to things, actions, places. E.g., math is gay, that movie was retarded

  • Colloquial, humorous, or race-neural uses of the spelling “nigga.” Context is very important – if any version of the word is used as a slur, or in a racialized context, it is automatically classified as Level 3.

  • Statements that suggest or imply affiliation with or support for a hate group, even if not serious, e.g., I’ll be partying with Hitler in hell

Level 0: Not hateful or controversial

Messages that do not contain hateful, offensive, or potentially controversial content related to protected groups or identity. To be clear, the following would not be flagged by Level 1, Level 2, or Level 3:

  • Identity, race, religious affiliation, etc. used in neutral or descriptive contexts, e.g., I’m black, my friend Jason is Jewish
  • Informational statements about protected groups in non-controversial contexts, e.g., Muslims celebrate Ramadan this month
  • Profanity that has a gendered undertone (“slut,” “bitch,” “whore”) when no other hateful content is present. Note: these words may instead be flagged as bullying or sexual, depending on context

Violence Model Head

This model head identifies text content that mentions violence, including threats toward an individual or group, encouraging or calling for violence, descriptions of past violence, self-harm, and other topics.

Here’s a breakdown of subject matter captured by each class/severity level:

Level 3: Serious Threats

This class flags violent threats that are explicitly malicious, intentional, realistic, and also involve severe physical or sexual violence. Level 3 covers threats where the speaker is directly threatening the violence or issuing a command or direct call for the violence. Actions that are severe enough to be flagged as Level 3 are:

  • Stabbing
  • Beating
  • Torture
  • Shooting
  • Kidnapping
  • Rape
  • Hanging
  • Killing
  • Breaking bones

This is not meant to be an exhaustive list. Generally, threats where the action is severe enough to cause serious bodily injury or death will be flagged.

This applies to threats of future violence [“if i see you around here again, i’m gonna do X”], descriptions of past violence by the speaker [“i did Y last time he crossed me”], and calls for violence in the imperative [“Z that guy!”].

🚧

NOTE:

To be flagged as Level 3, the threat needs to be perceivable as serious, deliberate and realistic. If context clearly indicates that violent language is used as a joke, exaggeration, or to be dramatic, this will be flagged as Level 1 instead (examples below).

Other cases that are captured by Level 3 include:

  • Threatening a person’s pets or animals. This generally applies only to threats against animals that would have a direct negative impact on a person or group
  • Threatening buildings or property when the act would have a direct negative impact on a person and carry serious risk of bodily harm
  • Threats in the form of a question
  • Violent threats in which humans are referred to as animals (e.g., “cockroaches,” “pigs,” “monkeys”)

Level 2: Incitement

This class captures cases that are not direct threats from the speaker, but nonetheless encourage, provoke, or support serious violence. The main point of differentiation from content captured by Level 3 is the use of hypothetical rather than direct language. For example, “I will do X” is classified as Level 3, but “I might do X,” “I want to do X,” or “Someone should do X” would be classified under Level 2. In other words, language that indicates a desire for violence or possible violence falls under Level 2 (incitement) rather than Level 3 (direct threat).

To be clear, this captures cases where:

  • The speaker is indicating possible violence they might or want to commit in the future
  • The speaker is calling for the violent action to be committed by others

Examples of hypothetical or equivocating language that might support a classification of Level 2 instead of Level 3 are actions phrased as “should,” “would,” “could,” “maybe,” “might,” “want to,” and “would like to,” and “needs to.”

Other cases that are captured by Level 2 include:

  • Content that incites or encourages destruction of or significant physical or financial damage to public property (streets, public buildings, monuments, or public spaces) or valuable personal assets (homes, businesses, cars), e.g., we need to just burn down the supreme court, someone should throw bricks through all their windows and loot their shit
  • Note: a difference between threats involving property in Level 3 and threats involving property in Level 2 involves explicit mention or very strongly implied threat of death or injury to people. For example, this can depend on the act (e.g., “bomb” or “blow up” versus “burn”).
  • Threats of self-harm, or encouraging another person to harm themselves
  • Calls for serious violence that can only be carried out by large-scale actors (i.e., a government or military) and are unlikely to be effected by the speaker

Level 1: Potentially violent, violent language, and neutral descriptions

This class generally captures violent language that is not clearly used as a direct threat or incitement, or isn’t severe enough to result in death or severe injury. This usually depends on context and other language used in the text. In many cases, content flagged by this class may still not be appropriate for children. Cases that are flagged as Level 1 include:

  • Denouncing, rejecting, discouraging, or criticizing serious violence, e.g.,it’s absolutely unacceptable that the police continue to shoot unarmed people
  • Suggesting or calling for a stop to serious violence or accountability for serious violence, e.g., They burned down the entire building! Come on someone needs to be held responsible for this
  • Objective, neutral descriptions of violent acts or events without support or encouragement, such as in the context of reporting, journalism, or recounting a story. For example, The armed suspect shot the victim 10 times, since then, she’s been receiving death threats
  • Phrases involving legal use of guns or gun rights, such as target shooting, hunting as a sport, reasonable self-defense, use of guns in military or policing contexts, and mentioning guns or use of guns without additional evidence of illegal use or violent intent
  • Phrases that refer to abortion or terminating a pregnancy as “killing babies” or other violent language
  • Calls for capital punishment or the death penalty in a clearly legal context (e.g., explicit references to a judge, court, or judicial system). If capital punishment or methods of execution are called for without clear reference or a judicial system or legal bodies, this would be Level 2 instead
  • Threats or descriptions of violence that are not severe enough to cause serious injury in the context of the text (punching, kicking, slapping, fighting, etc.). Note: Other language in the text that suggests a more serious threat can bump these up to Level 3. For example, ima kick your ass would be Level 1, but ima kick your teeth in is easily Level 3 (specific, serious).
  • Threats that involve damaging or destroying personal belongings like phones, bicycles, computers, clothing, etc. For example: I swear I’ll break your laptop in half if you keep playing games all day. Remember: Threats against more substantial property assets like homes, cars and business are classified as at least Level 2 because they likely involve risk of physical harm
  • Non-actionable statements that wish death on someone by non-violent means, e.g., he needs to get covid and die already
  • Violent language directed at an unspecified subject, usually indicated by “it,” “that,” “those,” “them,” etc. For example: burn it down baby!, i’m happy to let them just bleed out, kill it!. In this case, there’s not enough context for the message to be clearly understood as a threat or call for violence. But if other language clearly directs the violence towards a person or group (e.g., use of profanity to describe human subjects), the text will be classified as Level 2 or 3.
  • Jokes or seemingly ironic statements that mention violence of any kind

Level 0: Benign

This class captures all other content with no violent or threatening language, or language that could have a violent meaning but is used in a non-threatening way. For clarity, this includes:

  • Common sayings, slang, and figures of speech. As some examples: I have so much time to kill these days, They left me hanging for weeks, He finally dropped the bomb and broke up with his girlfriend, can you shoot them an email please etc.
  • Referring to violence in non-realistic, non-threatening, or proscribed contexts like video games and sports, e.g., bro I was on fire that game, 25 kills 2 deaths, did you watch the UFC fight last night?
  • Referring to accidental or unintentional injuries like car crashes, bike accidents, etc.
  • Mentions of deaths due to non-violent causes

Bullying Model Head

This model head generally captures insults and toxicity aimed at particular individuals or groups of people. For this purpose, language intended to frighten, harm, intimidate, mock, or cause distress is considered bullying. This can include harassing someone or threatening violence, cursing at someone, or disparaging their physical appearance, intellect, capability, personality, etc.

📘

NOTE:

Depending on the phrases used, overlap between bullying and other model heads is common, typically Hate and Violence.

Here’s a breakdown of subject matter captured by each class/severity level:

Level 3: Severe bullying and toxicity

This class captures:

  • Slurs and profane name calling directed at a specific person or at a group of people, including variants, abbreviations, and misspellings. This includes race-based slurs, gendered insults, slurs related to intellectual disability, etc.
  • Encouraging suicide or severe self-harm (e.g., cutting). This includes the abbreviation “kys” = "kill yourself" if used in this context
  • Severe threats of physical or sexual violence towards a person or group. To be classified as Level 3, the threat must be targeted, serious, explicitly malicious, and realistic (see description of Violence Level 3 above). Phrases need to be explicitly targeted to specific individuals or groups to be classified as Level 3. If violent language is obviously used as a joke, exaggeration, or figure of speech, or if there’s not enough context to tell the language is likely a threat, this would be Level 0.

Level 2: Bullying

This class captures messages that aren’t as egregious as Level 3, but could still be considered harmful, insulting, or threatening. Specifically, the Level 2 class flags:

  • Aggressive or derogatory cursing towards an individual or group. The main difference with Level 3 content and Level 2 is whether or not the individual is being called a profane word. Typically, Level 2 captures instances where profanity is used as a verb or action statement or to amplify the effect of another, non-profane statement. For example, "over here, fuckface" would be Level 3, whereas "fuck you" or "you guys are fkin idiots" would be Level 2
  • Non-profane insults directed at a specific individual or group of people. This captures both negative descriptors (e.g., "ugly," "stupid," "freak," "pathetic," "loser,") and insults that refer to people as animals or objects (e.g., "you're garbage," "that guy is such a snake," "you smell like a pig").
  • Encouraging non-severe self-harm, e.g., you should slap yourself
  • Non-severe threats of violence towards a specific individuals or groups, including encouraging the same. Generally this includes punching, slapping, kicking, etc. For these purposes, threats where the action would not cause, bleeding, severe bodily injury, loss of consciousness, or death are considered non-severe.
  • Attempts to silence someone or exclude them a group or conversation. For example, telling someone to shut up, telling someone to get out, fuck off, etc. are always classified as Level 2. Note: If these statements also include profane insults, slurs, etc., they would be instead classified as Level 3.
  • Disparaging statements where the bullying language cannot be pinned down to a particular word or phrase, but instead refer holistically to someone's physical appearance, intellect, competence, personality, etc with the intent to bully or hurt. For example: yuck no one would ever want to hang out with you

Level 1: Possible bullying, non-bullying profanity

This class generally captures content that uses potentially bullying language in non-bullying contexts. Generally, either the language used or the way that language is used is not serious enough to flag the text as Level 2 or Level 3, but still might be considered inappropriate or controversial. Specifically, this class captures:

  • Referring to people with profanity in a non-bullying context or profanity that's not directed at specific people, e.g., bitches be doing anything for the gram, damn he a badass mofo
  • Using the exact spelling "nigga" in a neutral, colloquial, or positive context [Note: ALWAYS Level 3 if used negatively or spelled with an "r"]
  • Certain non-profane words (e.g., "ugly," "stupid," "freak," "pathetic," "loser,") when used alone, or when there is not enough context to tell that the word is intended to describe a person
  • Mentions that the speaker is seriously considering suicide or self-harm. Note that similar language used as an exaggeration or to be dramatic is NOT flagged.
  • Threats that are not directed at a particular person or group of people, even if severe (this would be flagged as Violence instead)
  • Neutral, observational statements that describe past bullying, e.g., he just called her a slut and left
  • Neutral phrases or descriptions that may interpreted as harmful by the reader, but are not clearly intended to be hurtful, e.g., I wouldn't say you're hot, but you're definitely not ugly either
  • Self-deprecating statements, even if using profanity, e.g., i'm a basic bitch what can I say
  • Statements that negate bullying or defend someone from bullying, e.g., you are NOT ugly
  • Playful teasing that may include bullying language (e.g., "booo you suck," "omg stop I hate you," "lame") but is harmless in context.

Level 0: Benign

This class flags all other content with no bullying-related language, or bullying-relevant language used in a non-bullying context. To clarify, content not captured by the Level 3, Level 2, and Level 1 classes includes:

  • Profanity when not related to people, used to describe people, or directed at people (includes sexual language).
  • Associations between people and animals in neutral or positive context such as a pet name or nickname, or as a positive attribute. E.g., he was a cheetah on the track today
  • Bullying-relevant language like violence or self-harm used as a figure of speech, exaggeration, or common saying
  • Mocking, degrading, or criticizing hate groups
  • Neutral, non-bullying statements and observations (e.g., "you're being rude right now")

Child Exploitation Model Head

This class flags content that mentions or explicitly alludes to child sexual exploitation, including child pornography (both soliciting and distributing material), child sexual abuse, and soliciting or descriptions of sexual activity with or by children under the age of 18.

Level 3: Child Exploitation

Specifically this includes:

  • Messages that solicit, advertise, or attempt to distribute child pornography. This includes the acronym “CP” if used in the context of sending or receiving links, or alongside other sexual language. References to trading links (“STR”, “S2R” = send to receive), buying links, and distribution platforms (e.g., Dropbox, MEGA) are also flagged if context indicates they are likely to refer to child pornography (e.g., also mentions kids, teens, CP, or age ranges below 18)
  • Messages that solicit or describe sexual activity with underaged individuals or child sexual abuse of any kind. This includes sexualized touching, and non-touch abuse like stripping, voyeurism, photographing pornography, etc.
  • Users that are identifiably underaged (e.g., I’m 14 looking for X) soliciting underaged pornography or sex. This also applies to identifiably underaged users offering to distribute child pornography (e.g., nude photos of themselves or friends)
  • Sexual roleplay or fantasies involving underaged children

Level 0: No Child Exploitation

All other content that does not contain the above is flagged as Level 0. This includes:

  • Cases where there is not enough context to conclude that the message is referring to child sexual exploitation or child pornography. For example, Looking to buy Mega links with no other language associating “Mega links” with child pornography is not enough to trigger a Level 3 classification. Similarly, descriptors like “young” are not sufficient without additional context to infer that “young” means under 18.
  • Sexual content or mentions of pornography involving adults. In the absence of specific language about age, children, teens, etc., we presume that sexual content and messages refer to legal, of-age conduct.

Promotions Model Head

This class captures content that advertises or promotes a product, service, account, or event when the text also includes a suggested action (e.g., reposting, donating) or an attempt to redirect to another application or platform (e.g., links, contact information).

Level 3: Promotion

Types of messages flagged by the promotions class include:

  • Messages that advertise a giveaway or free service (or spam and phishing attempts disguised as a giveaway or free service), e.g., $5 special on custom videos on my site: https://example.com, Feeling depressed? call our free hotline on (000)-000-0000
  • Requests for follows or subscriptions, e.g.,Want to learn more about Bitcoin? Follow me on Twitter for more awesome content!, I'm on twitter now! Find me at https://example.xyz
  • Soliciting donations, e.g., Send $5 to my Venmo for a chance to win custom-made stuff! https://t.co/CO6RT
  • Promoting events and activities with a call to action (e.g., to repost, attend, meet, or join). For example, Please Repost! Join us at Decker Plaza for the memorial at 10:00pm

Level 0: No Promotion

All other messages that either do not promote/advertise a product, service, etc. or don't attempt to redirect to another platform or prompt the reader to do something will be classified as Level 0 (no promotion). For clarity, this includes:

  • Mentioning follower or subscriber count on a social platform or website without a request to follow/subscribe, e.g., I have 10K followers on Twitter
  • Expressing gratitude for past follows, subscriptions, or donations, e.g., Thanks so much for subscribing to my channel!
  • Mentioning products, services, accounts, or events in casual conversation, e.g., I don't know dude, if you don’t wanna suck at cooking just watch Gordon Ramsay’s youtube videos
  • Jokes, sarcasm, or ironic statements, e.g., For every 10000$ you donate to us, 1$ will go to the homeless people! if you ask where the rest of the money is going? then you are just a right-wing bigot!

Gibberish Model Head

This model head flags keyboard spam and messages that have no comprehensible meaning.

Level 3: Gibberish

This class flags messages where the entire message is incomprehensible. Examples would be strings along the lines of grljwbrg, dfoibhnlfadknbsdfg.

Level 0: No Gibberish

Messages with that convey an intelligible meaning through words and phrases are not classified as gibberish, including acronyms, emojis, character replacements, and non-Latin text. For clarity, the following counts as a word or phrase:

  • Any string that contains an English word with four letters or more, even if part of a larger string, e.g., qpwelcome-1po (avoids nearly all user names)
  • Recognizable acronyms and slang (e.g., lmfao, stg)
  • Intentional pattern repetition of short base words (e.g., "hahahahaha" and foreign language variants, "pewpewpewpew", "lololololol"). If the base unit being repeated is not a word (e.g., "asd"), then it will be flagged as gibberish
  • Elongated words or acronyms that use extra letters (e.g., "ooooooooof", "lmaooooooo", "wtffff", "no waaaaay")
  • Emojis and stylized text
  • Non-latin text (e.g., Arabic script, Japanese characters)
  • Brand names
  • Text that contains numbers only

Phone Number Model Head

This model head detects phone numbers in message strings, including international formats.

Level 3: Phone Number

This class flags phone numbers using context and syntax to differentiate between other strings of numbers.

Level 0: No Phone Number

All other messages that do not contain a number that is indicated to be or recognizable as a phone number is classified as Level 0: no phone number.

Spam Model Head

Level 3: Spam

This class flags text that contains a link intended to redirect other users to a different site or platform. This includes links and link shorteners, email addresses, and phone numbers. Links to popular and reputable websites, such as news organizations and publications, YouTube, large social platforms, Wikipedia, etc. are not flagged as spam.

Level 0: No Spam

This class is flagged when the text does not contain such a link or includes a link to a popular and reputable website. Other types of messages typically associated with spam content may be flagged by the Promotions model (e.g., advertising a service, platform, or account) or the Gibberish model (unintelligible messages).