Detailed Class Descriptions – Text Moderation

Overview

This page gives a more comprehensive description of Hive’s Text Moderation classes. If you need more details or have further questions on certain types of content after reading the main Text Moderation page, look here. We’ll describe as clearly as possible and provide examples of what is covered by each model head, including multi-level classes if applicable.

All platforms have different moderation requirements and risk sensitivities, so we recommend that you consult these descriptions carefully as you decide which classes to moderate (and at what severity).

📘

NOTE:

To determine which classes and severity levels cover specific types of text, it may be helpful to search this page (Crtl/Cmd + F) with relevant terms. You can also use the sidebar on the right to navigate to classes of interest.

General Notes

Before looking at subject-matter breakdowns for each class, it’s helpful to understand the following:

  1. Hive’s text classifier is multi-headed. Classifications from each model head (e.g., sexual, hate, violence, etc.) are made independently and returned together. If a message scores highly in multiple classes, it meets our definition of each class.
  2. Some model heads define four classes that capture different severity levels. In the API response, these multi-level classes are represented by severity scores spanning integer values from 0 (benign) to 3 (most severe). Other model heads – spam, child exploitation, promotions, and phone numbers – are binary. For binary model heads, content will be classified as either Level 3 or Level 0. In these cases, Level 3 simply means that the relevant content is present; it is not intended to be an indicator of severity.
  3. Just because a message is classified as Level 0 for one model head does not mean the content is clean. For this reason, we will describe the Level 0 classes as a non-exhaustive list of subject that is not captured by the higher severity classes but that might be helpful to distinguish between what is and is not flagged (e.g., borderline content, content captured by other classes).
  4. Usernames that use intelligible words and phrases (including character replacements) generally follow the same rules as messages and other text strings.
  5. The lists given in these subject matter breakdowns are not exhaustive, and the context around how a particular word or phrase is used can have a significant impact on the classification. We’ll provide descriptive examples and rules around contextual factors where relevant, but you can also use our web demo to get a rough impression for other cases interest.

Sexual Model Head

This is the main model head for flagging sexual messages and text content like solicitation, descriptions of sexual activity, and mentions of pornography. Slang that has a sexual connotation and certain emojis will also be flagged if there is enough context to infer sexual meaning.

Here’s a breakdown of subject matter captured by each class/severity level.

Level 3: Sexually Explicit

Captures text that explicitly references sexual activity, genitalia, and pornography. Text will also be classified as a level 3 if it explicitly refers to illegal or non-consensual activity. Specifically, level 3 flags the following cases:

-Phrases that explicitly or indirectly reference sexual intercourse or sexual activity with high confidence, including:

  • Sexual intercourse or penetration of any kind
  • Oral sex
  • Anal sex
  • Orgasm or sexual bodily fluids

-Phrases that explicitly reference or describe an individual masturbating, including:

  • Fingering
  • Jerking off and similar terms
  • Orgasm and sexual bodily fluids (see above)
  • Other references to masturbation e.g., brb gonna go stroke my wood
  • Note: Mentioning masturbation in an informational context without reference to a particular individual or act is classified as Level 1

-Phrases that mention or discuss pornography, porn websites, porn actors, or cam sites

-Phrases that mention or describe the use of sex toys by a person, including:

  • Dildos, fleshlights, vibrators, buttplugs, etc.
  • Common objects used as sex toys if there is context to conclude sexual intent
  • Note: Sex toys that are simply described without referring to individual use are classified as Level 1.

-Phrases that explicitly suggest sexual desire or turns-ons, including:

  • “Horny” and similar words
  • “Hard,” “wet,” and similar words, if context is sufficient to infer sexual meaning
  • “MILF” and variants

Phrases that include explicit mentions of genitals, sexual body parts, and sexual bodily fluids (including slang and synonyms)

  • This includes emojis commonly used to substitute for words describing body parts and sexual acts
  • “Butt” and synonyms are classified as Level 3 if used in the context of anal sex or penetration, typically Level 2 otherwise.
  • “Dick,” “pussy,” “asshole,” etc. are classified as Level 1 if used as an insult or in non-sexual contexts. Similarly, if “balls” is used as a colloquialism for courage, this would also be classified as Level 1 (see Sexual Level 1 description for examples)

-Phrases that use “whore” or “slut” to describe a person, context does not necessarily need to be sexual

-Phrases that explicitly describe, condone, promote, or threaten non-consensual or unlawful sexual activity with enough context to infer sexual meaning, including:

  • Rape and Molestation
  • Incest
  • Pedophilia
  • Prostitution
  • Bestiality

Level 2: Sexual

This class captures text that references sexual content like nudity and suggestive messages that are not necessarily explicit or use tamer language. Specifically, the level 2 class will flag:

-Phrases that explicitly mention nudity, but do not reference genitalia, breasts, etc, including:

  • References to being nude or undressing
  • Requesting or mentioning sending or receiving nude photos (note: the reference to nude photos must be explicit)
  • References to strippers or strip clubs

-Phrases that reference foreplay or kissing, licking, or sucking non-genital body parts (feet, face, neck, inner thighs, etc.)

-Phrases that implicitly suggest sexual desire or sexual intent ("turned on," "aroused," etc.)

-Phrases that comment suggestively on frequently sexualized body parts of a particular person (e.g., usually accompanied by names or pronouns like “your,” “her,” “his,” etc.).

  • This always includes “butt,” “ass,” and relevant synonyms and emojis, but can also include legs or lips if the accompanying language is suggestive enough, e.g., your legs are so sexy
  • Note: phrases that do the same for genitals and breasts are classified as Level 3

-Phrases that reference lingerie or revealing underwear (jockstraps, thongs, panties, etc.) if mentioned in connection with a wearer or human subject (usually accompanied by names or pronouns)

  • If lingerie or underwear is simply described with no clear reference to a person (e.g., a description of a product or general comment), this would be instead be classified as Level 1, e.g., that store sells really comfortable bras

-Phrases that use “ho,” “hoe,” or “thot” to refer to a person or group of people

Level 1: Potentially Sexual

This class captures references to sexuality and relationships that are not suggestive or explicit, like pet names, relationship status, and non-sexual compliments on appearance. Depending on context, Level 1 can also include non-sexual uses of potentially sexual words (e.g., profanity, insults), as well as sexual subject matter discussed in an informational, medical, or educational setting.

📘

USAGE:

Most of the content flagged by the Level 1 class would likely be considered benign by adults, but may not be appropriate for children.

Specifically, the level 1 class will flag:

-Phrases that mention affectionate activities besides sex, masturbation, and foreplay, including:

  • Kissing, making out
  • Cuddling
  • Hooking up if there is not enough context to tell that the phrase refers to sex
  • Note: If “hook up” or similar phrases clearly refer to sex based on context, this would instead be classified as Level 3. For example: we hooked up last night but the condom broke

-Phrases that include innocent or non-sexual flirting. This includes:

  • Compliments on a person’s physical appearance or non-sexualized attributes, e.g., your dimples are really cute
  • Lovey or sweet messages between romantic partners
  • Phrases that include pet names in a flirtatious way ("cutie," "babe," "sweetie")
  • Note: “Sexy” will be classified as at least Level 1, even when used to describe objects or non-sexual actions, e.g., that was a sexy shot

-Phrases that discuss level of sexual experience or relationship status

-Phrases that mention sexuality or sexual orientation in a casual, non-suggestive way

-Non-sexual uses of words that are otherwise be sexual, including:

  • “Dick,” “cock,” “pussy,” and similar terms when used as an insult or to convey weakness or disrespect
  • “Fucker” or “motherfucker” when used an insult or to describe a person in a non-sexual context
  • “Fuck me/him/her/them” if used without additional/sexual context
  • “Balls” when used as slang for courage
  • “Slut” or “whore” when used to convey attention-seeking behavior or fondness for non-human and non-sexual objects e.g., i hate all these self absorbed social media whores, i’m such a slut for chipotles queso
  • “Porn” when used in non-sexual contexts, like describing other types of photos/videos (e.g., Earth porn, food porn)

-Potentially sexual subject manner if discussed in informational, educational, or medical context, including:

  • Medical procedures and conditions, STDs, certain medications (e.g., Viagra, Plan B), certain types of contraception (e.g., condoms)
  • Discussions of human anatomy, sex, or reproductive health in medical contexts
  • Discussion of strip clubs, pornography, and sex workers in an informational or news context, e.g., Porn stars made over $1.2B in pay in 2019
  • Discussion of sperm banks, sperm donations, and animal sperm in breeding contexts
  • Discussion of sex crimes or illegal sex acts in a news context if discussed purely matter-of-fact and without actual description of sexual activity, e.g., sexual assault has become a growing problem on college campuses

-Sexual terms used as a figure of speech and not directed towards a human subject, e.g., I got absolutely railed by that exam

-Instances where the speaker is:

  • Rejecting a sexual advance
  • Rejecting, expressing disapproval, or non-consent on any mentioned sexual subject matter, e.g., i’m not gonna send nudes lol

Level 0: Non-Sexual

Messages that do not contain sexual or suggestive content or language are classified as Level 0 (benign). This includes some cases where potentially sexual words are used as a figure of speech, in other contexts, or when there is not enough context to infer sexual meaning or intent. For clarity, content is not flagged as Level 1, Level 2, or Level 3 includes:

  • Phrases that mention hugging, non-romantic, or parental affection
  • “Breast” when referring to food (e.g., turkey breast, chicken breast)
  • Internal reproductive organs (e.g., uterus, ovaries)
  • Pregnancy
  • The word “sex” when used is the context of gender, e.g., my sex is female
  • “Beautiful,” “hot,” “cute” etc. used to describe or compliment objects, things, or oneself e.g., I look HOT today
  • Insults or figures of speech using (non-sexual body) parts
  • “Fuck” used as profanity or as an insult without sexual meaning (this is flagged as profanity and/or bullying instead)
  • “Naked” or “nude” when not describing a person or used metaphorically, e.g., I feel naked without my headphones
  • “Suck” used descriptively or as an insult, e.g., You suck, their pizza fucking sucks, suck it!

Hate Model Head

This is our main model head for flagging hate speech directed at minorities and protected groups, including slang, slurs, and hateful use of emojis (e.g., different skin tones) depending on context.

🚧

NOTE:

Hateful or discriminatory language targeted at a specific individual or a specific group of people (e.g., a group of friends or classmates) is typically flagged as bullying instead of or in addition to hate.

Level 3: Hate Speech

This class flags language that is overtly discriminatory, threatening, directed, and/or violent. Slurs, demeaning terms, comparisons of minority groups to animals, and hateful ideology are all categorized as Level 3. The scope of content covered by Level 3 aligns roughly with the legal definition of hate speech against protected groups (including racial minorities, religious minorities, LGBT+ people, women), although slurs or demeaning terms directed at white people are also covered. Immigration status and nationality is also considered sensitive. Specifically, the Level 3 class will flag:

  • Phrases that call for or justify violence against a particular group
  • Phrases that describe a particular group as physically or morally inferior
  • Phrases that describe a particular group as criminals
  • Phrases that refer to a particular group as animals, sub-human, or non-human. Also applies to words like “trash,” “scum,” and “savages”
  • Slurs of any kind, including variants, intentional misspellings, and character substitutions
  • Support or promotion of hateful ideology or hate groups, including:
  1. Content that speaks positively about symbols, slogans, gestures, groups, policies, or individuals that are explicitly hateful (Nazism, the Ku Klux Klan and other white power groups, hateful policies or atrocities toward protected groups)
  2. Content that denies or makes light of well-documented atrocities or violent events against minority groups or individuals (Holocaust or other genocide denial, hate crimes, lynchings, etc.)

Level 2: Hateful

This class flags language that is a bit softer than hate speech but might exclude, silence, intimidate, or dissuade individuals from participating or feeling safe in a discussion or online community based on race, gender, sexuality, or religion. Specifically, the Level 2 class flags:

  • Phrases that perpetuate negative stereotypes, negative descriptions of different cultural practices, and references to specific physical traits used to stereotype groups
  • Statements that target protected groups on the basis of religious or moral beliefs, e.g., homosexuality is a sin
  • Certain slurs (e.g., "gay," "retarded") when used as casual insults rather than to target someone in the relevant protected group
  • Denying an individual’s gender identity (references to trans, non-binary, genderfluid etc. identities must be explicit)
  • Advocating for the removal of a protected group’s civil rights or legal protections
  • Promoting, condoning, or justifying exclusion, discrimination, or inequality on the basis of protected characteristics, e.g., it’s wrong for women to be the breadwinner, that should be a man’s job
  • Advocating violence against or destruction of religious texts, religious symbols, or places of worship
  • Language that is not explicitly hate speech, but degrades or implies lesser status of a protected group
  • Language that denounces, rejects, or criticizes the slurs flagged by Level 3 (if the message includes the word itself)

Level 1: Controversial

Generally, this class flags language around protected groups that might imply prejudice, bring up negative connotations, or provoke controversy or conflict. Content flagged as Level 1 would also generally be considered inappropriate for children. Generally, Level 1 flags:

  • Statements challenging the validity of minority status, e.g., asians don't face discrimination

  • Neutral or negative references to hateful ideology, hate symbols, slogans, gestures, groups, or policies that are understood as hateful, e.g., Police are the 21st century KKK. Note: Statements that express support are instead classified as Level 3 (see above)

  • Controversial topics mentioned in connection with a particular group (even in an informational manner), including crime, politics, voting, morality, intellect, educational achievement, work ethic, health, incarceration, privilege, reproductive health, colonization, cultural or religious attire, censorship & free speech, and civil rights. For example: gay men are 30 times more likely to get AIDS, Blacks make up half the prison population,white privilege doesn’t exist

  • Statements that defend against negative stereotypes or denounce slurs

  • Statements that imply preference for some groups over others, e.g.,I’m not attracted to Asians tbh

  • Statements that reference discrimination or inequities faced by a particular group, e.g., he’s only being attacked because he’s black

  • Controversial or vulgar terms related to a protected group when used in a non-hateful context and/or not directed toward an individual. For example, using “gay” or “retarded” with a negative connotation in references to things, actions, places. E.g., math is gay, that movie was retarded

  • Colloquial, humorous, or race-neutral uses of the spelling “nigga.” Context is very important – if any version of the word is used as a slur, or in a racialized context, it is automatically classified as Level 3.

  • Statements that suggest or imply affiliation with or support for a hate group, even if not serious, e.g., I’ll be partying with Hitler in hell

Level 0: Not hateful or controversial

Messages that do not contain hateful, offensive, or potentially controversial content related to protected groups or identity. To be clear, the following would not be flagged by Level 1, Level 2, or Level 3:

  • Identity, race, religious affiliation, etc. used in neutral or descriptive contexts, e.g., I’m black, my friend Jason is Jewish
  • Informational statements about protected groups in non-controversial contexts, e.g., Muslims celebrate Ramadan this month
  • Profanity that has a gendered undertone (“slut,” “bitch,” “whore”) when no other hateful content is present. Note: these words may instead be flagged as bullying or sexual, depending on context

Violence Model Head

This model head identifies text content that mentions violence, including threats toward an individual or group, encouraging or calling for violence, descriptions of past violence, self-harm, and other topics.

Here’s a breakdown of subject matter captured by each class/severity level:

Level 3: Serious Threats

This class flags violent threats that are explicitly malicious, intentional, realistic, and also involve severe physical or sexual violence. Level 3 covers threats where the speaker is directly threatening the violence or issuing a command or direct call for the violence. Actions that are severe enough to be flagged as Level 3 are:

  • Stabbing
  • Beating
  • Torture
  • Shooting
  • Kidnapping
  • Rape
  • Hanging
  • Killing
  • Breaking bones

This is not meant to be an exhaustive list. Generally, threats where the action is severe enough to cause serious bodily injury or death will be flagged.

This applies to threats of future violence [“if i see you around here again, i’m gonna do X”], descriptions of past violence by the speaker [“i did Y last time he crossed me”], and calls for violence in the imperative [“Z that guy!”].

🚧

NOTE:

To be flagged as Level 3, the threat needs to be perceivable as serious, deliberate and realistic. If context clearly indicates that violent language is used as a joke, exaggeration, or to be dramatic, this will be flagged as Level 1 instead (examples below).

Other cases that are captured by Level 3 include:

  • Threatening a person’s pets or animals. This generally applies only to threats against animals that would have a direct negative impact on a person or group
  • Threatening buildings or property when the act would have a direct negative impact on a person and carry serious risk of bodily harm
  • Threats in the form of a question
  • Violent threats in which humans are referred to as animals (e.g., “cockroaches,” “pigs,” “monkeys”)

Level 2: Incitement

This class captures cases that are not direct threats from the speaker, but nonetheless encourage, provoke, or support serious violence. The main point of differentiation from content captured by Level 3 is the use of hypothetical rather than direct language. For example, “I will do X” is classified as Level 3, but “I might do X,” “I want to do X,” or “Someone should do X” would be classified under Level 2. In other words, language that indicates a desire for violence or possible violence falls under Level 2 (incitement) rather than Level 3 (direct threat).

To be clear, this captures cases where:

  • The speaker is indicating possible violence they might or want to commit in the future
  • The speaker is calling for the violent action to be committed by others

Examples of hypothetical or equivocating language that might support a classification of Level 2 instead of Level 3 are actions phrased as “should,” “would,” “could,” “maybe,” “might,” “want to,” and “would like to,” and “needs to.”

Other cases that are captured by Level 2 include:

  • Content that incites or encourages destruction of or significant physical or financial damage to public property (streets, public buildings, monuments, or public spaces) or valuable personal assets (homes, businesses, cars), e.g., we need to just burn down the supreme court, someone should throw bricks through all their windows and loot their shit
  • Note: a difference between threats involving property in Level 3 and threats involving property in Level 2 involves explicit mention or very strongly implied threat of death or injury to people. For example, this can depend on the act (e.g., “bomb” or “blow up” versus “burn”).
  • Threats of self-harm, or encouraging another person to harm themselves
  • Calls for serious violence that can only be carried out by large-scale actors (i.e., a government or military) and are unlikely to be effected by the speaker

Level 1: Potentially violent, violent language, and neutral descriptions

This class generally captures violent language that is not clearly used as a direct threat or incitement, or isn’t severe enough to result in death or severe injury. This usually depends on context and other language used in the text. In many cases, content flagged by this class may still not be appropriate for children. Cases that are flagged as Level 1 include:

  • Denouncing, rejecting, discouraging, or criticizing serious violence, e.g.,it’s absolutely unacceptable that the police continue to shoot unarmed people
  • Suggesting or calling for a stop to serious violence or accountability for serious violence, e.g., They burned down the entire building! Come on someone needs to be held responsible for this
  • Objective, neutral descriptions of violent acts or events without support or encouragement, such as in the context of reporting, journalism, or recounting a story. For example, The armed suspect shot the victim 10 times, since then, she’s been receiving death threats
  • Phrases involving legal use of guns or gun rights, such as target shooting, hunting as a sport, reasonable self-defense, use of guns in military or policing contexts, and mentioning guns or use of guns without additional evidence of illegal use or violent intent
  • Phrases that refer to abortion or terminating a pregnancy as “killing babies” or other violent language
  • Calls for capital punishment or the death penalty in a clearly legal context (e.g., explicit references to a judge, court, or judicial system). If capital punishment or methods of execution are called for without clear reference or a judicial system or legal bodies, this would be Level 2 instead
  • Threats or descriptions of violence that are not severe enough to cause serious injury in the context of the text (punching, kicking, slapping, fighting, etc.). Note: Other language in the text that suggests a more serious threat can bump these up to Level 3. For example, ima kick your ass would be Level 1, but ima kick your teeth in is easily Level 3 (specific, serious).
  • Threats that involve damaging or destroying personal belongings like phones, bicycles, computers, clothing, etc. For example: I swear I’ll break your laptop in half if you keep playing games all day. Remember: Threats against more substantial property assets like homes, cars and business are classified as at least Level 2 because they likely involve risk of physical harm
  • Non-actionable statements that wish death on someone by non-violent means, e.g., he needs to get covid and die already
  • Violent language directed at an unspecified subject, usually indicated by “it,” “that,” “those,” “them,” etc. For example: burn it down baby!, i’m happy to let them just bleed out, kill it!. In this case, there’s not enough context for the message to be clearly understood as a threat or call for violence. But if other language clearly directs the violence towards a person or group (e.g., use of profanity to describe human subjects), the text will be classified as Level 2 or 3.
  • Jokes or seemingly ironic statements that mention violence of any kind

Level 0: Benign

This class captures all other content with no violent or threatening language, or language that could have a violent meaning but is used in a non-threatening way. For clarity, this includes:

  • Common sayings, slang, and figures of speech. As some examples: I have so much time to kill these days, They left me hanging for weeks, He finally dropped the bomb and broke up with his girlfriend, can you shoot them an email please etc.
  • Referring to violence in non-realistic, non-threatening, or proscribed contexts like video games and sports, e.g., bro I was on fire that game, 25 kills 2 deaths, did you watch the UFC fight last night?
  • Referring to accidental or unintentional injuries like car crashes, bike accidents, etc.
  • Mentions of deaths due to non-violent causes

Bullying Model Head

This model head generally captures insults and toxicity aimed at particular individuals or groups of people. For this purpose, language intended to frighten, harm, intimidate, mock, or cause distress is considered bullying. This can include harassing someone or threatening violence, cursing at someone, or disparaging their physical appearance, intellect, capability, personality, etc.

📘

NOTE:

Depending on the phrases used, overlap between bullying and other model heads is common, typically Hate and Violence.

Here’s a breakdown of subject matter captured by each class/severity level:

Level 3: Severe bullying and toxicity

This class captures:

  • Slurs and profane name calling directed at a specific person or at a group of people, including variants, abbreviations, and misspellings. This includes race-based slurs, gendered insults, slurs related to intellectual disability, etc.
  • Encouraging suicide or severe self-harm (e.g., cutting). This includes the abbreviation “kys” = "kill yourself" if used in this context
  • Severe threats of physical or sexual violence towards a person or group. To be classified as Level 3, the threat must be targeted, serious, explicitly malicious, and realistic (see description of Violence Level 3 above). Phrases need to be explicitly targeted to specific individuals or groups to be classified as Level 3. If violent language is obviously used as a joke, exaggeration, or figure of speech, or if there’s not enough context to tell the language is likely a threat, this would be Level 0.

Level 2: Bullying

This class captures messages that aren’t as egregious as Level 3, but could still be considered harmful, insulting, or threatening. Specifically, the Level 2 class flags:

  • Aggressive or derogatory cursing towards an individual or group. The main difference with Level 3 content and Level 2 is whether or not the individual is being called a profane word. Typically, Level 2 captures instances where profanity is used as a verb or action statement or to amplify the effect of another, non-profane statement. For example, over here, fuckface would be Level 3, whereas fuck you or you guys are fkin idiots would be Level 2
  • Non-profane insults directed at a specific individual or group of people. This captures both negative descriptors (e.g., "ugly," "stupid," "freak," "pathetic," "loser,") and insults that refer to people as animals or objects (e.g., "you're garbage," "that guy is such a snake," "you smell like a pig").
  • Encouraging non-severe self-harm, e.g., you should slap yourself
  • Non-severe threats of violence towards a specific individuals or groups, including encouraging the same. Generally this includes punching, slapping, kicking, etc. For these purposes, threats where the action would not cause, bleeding, severe bodily injury, loss of consciousness, or death are considered non-severe.
  • Attempts to silence someone or exclude them a group or conversation. For example, telling someone to shut up, telling someone to get out, fuck off, etc. are always classified as Level 2. Note: If these statements also include profane insults, slurs, etc., they would be instead classified as Level 3.
  • Disparaging statements where the bullying language cannot be pinned down to a particular word or phrase, but instead refer holistically to someone's physical appearance, intellect, competence, personality, etc with the intent to bully or hurt. For example: yuck no one would ever want to hang out with you

Level 1: Possible bullying, non-bullying profanity

This class generally captures content that uses potentially bullying language in non-bullying contexts. Generally, either the language used or the way that language is used is not serious enough to flag the text as Level 2 or Level 3, but still might be considered inappropriate or controversial. Specifically, this class captures:

  • Referring to people with profanity in a non-bullying context or profanity that's not directed at specific people, e.g., bitches be doing anything for the gram, damn he a badass mofo
  • Using the exact spelling "nigga" in a neutral, colloquial, or positive context [Note: ALWAYS Level 3 if used negatively or spelled with an "r"]
  • Certain non-profane words (e.g., "ugly," "stupid," "freak," "pathetic," "loser,") when used alone, or when there is not enough context to tell that the word is intended to describe a person
  • Mentions that the speaker is seriously considering suicide or self-harm. Note that similar language used as an exaggeration or to be dramatic is NOT flagged.
  • Threats that are not directed at a particular person or group of people, even if severe (this would be flagged as Violence instead)
  • Neutral, observational statements that describe past bullying, e.g., he just called her a slut and left
  • Neutral phrases or descriptions that may interpreted as harmful by the reader, but are not clearly intended to be hurtful, e.g., I wouldn't say you're hot, but you're definitely not ugly either
  • Self-deprecating statements, even if using profanity, e.g., i'm a basic bitch what can I say
  • Statements that negate bullying or defend someone from bullying, e.g., you are NOT ugly
  • Playful teasing that may include bullying language (e.g., "booo you suck," "omg stop I hate you," "lame") but is harmless in context.

Level 0: Benign

This class flags all other content with no bullying-related language, or bullying-relevant language used in a non-bullying context. To clarify, content not captured by the Level 3, Level 2, and Level 1 classes includes:

  • Profanity when not related to people, used to describe people, or directed at people (includes sexual language).
  • Associations between people and animals in neutral or positive context such as a pet name or nickname, or as a positive attribute. E.g., he was a cheetah on the track today
  • Bullying-relevant language like violence or self-harm used as a figure of speech, exaggeration, or common saying
  • Mocking, degrading, or criticizing hate groups
  • Neutral, non-bullying statements and observations

Drugs Model Head (Beta)

This model head identifies text content that mentions the sale, possession, and usage of drugs. Across all severity levels, this includes references to buying, selling, or trading drugs, promotions of drug usage, and descriptions of past or present drug usage (recreational or otherwise). This head also flags text that discourages or speaks in a neutral tone about drug use, such as in informational or educational contexts.

Here’s a breakdown of subject matter captured by each class/severity level:

Level 3: Trading, promotion, or encouragement of drugs

This class flags content that describes the acquisition of drugs or text that explicitly promotes, advertises, or encourages drug use. Content that is severe enough to be flagged as Level 3 includes:

  • Phrases that describe buying, selling, or trading of drugs
  • Phrases asking for information regarding the acquisition of drugs
  • Phrases advertising drugs or drug use, such as I’m selling an eighth of weed for 30 bucks or You need a plug?
  • Phrases explicitly promoting or encouraging drug use to someone else, such as You should try [drug name]

Level 2: Discussions of recreational drug use

This class flags references to past drug acquisition or use as well as descriptions of recreational use that do not promote drugs to others. Specifically, the Level 2 class flags:

  • Phrases that describe recreational drug use but do not do any of the following:
    - Promote or encourage other people's drug use
    - Describe drug use related to rehab or recovery
    - Reference the intention of buying, selling, or trading drugs
  • Phrases that discuss past drug use and events without promoting drug use
  • Admissions of past experiences buying, selling, trading drugs
  • Phrases that ask questions that imply intention of drug use
  • Phrases that are drug-specific and likely used in a drug context
  • Phrase is asking questions with implied intention of drug use, e.g., Do you wanna smoke weed? or Do you have any weed on you?

Level 1: Neutral discussions or discouragement of drugs

Generally, this class flags language around drugs that is neutral or informational, discouraging, or ambiguous in meaning. Generally, Level 1 flags:

  • Phrases that include neutral, educational, or informational descriptions of drug use, such as news articles or medical contexts
  • A 3rd person describing a drug-related event in a neutral tone with no intention of wanting to engage in the drug-related behavior
  • Phrases discouraging or negating drug use
  • Phrase is a neutral or informational question that does not imply the speaker wants to engage in drug use
  • Text includes a word or phrase that has dual-intention (two meanings) and is in an unclear context, e.g., coke or smoke

Level 0: No mention of drugs or drug use

This class includes content with no mention of drugs, as well as content that uses a drug-related word or phrase in a non-drug content. To be clear, the following would not be flagged by Level 1, Level 2, or Level 3:

  • Phrase does not mention drugs based on the context
  • Phrase contains a word with dual-intention (two meanings) but is more likely to be indicating normal usage of the word based on the context, e.g., You need to blow on it first, it's really hot or Was the coke cold enough? I just put it in the fridge
  • Phrase is drug-specific but is clearly being used in non-drug context, e.g., Her shoes are dope or What's the addy to the party?

Weapons Model Head (Beta)

This model head identifies text content that mentions the sale, possession, construction, and usage of weapons. Across all severity levels, this includes references to buying, selling, trading, and constructing weapons as well as promotions or general descriptions of weapon usage. This head also flags text that discourages or speaks in a neutral tone about the use of weapons, such as in informational or educational contexts.

For the purposes of this head, non-powder and toy guns such as BB guns, airsoft guns, paintball guns, air guns, and blow guns are not considered to be firearms.

Here’s a breakdown of subject matter captured by each class/severity level:

Level 3: Buying, selling, trading, and constructing firearms and bombs

This class flags content that describes the acquisition, construction, possession, or promotion of explosives as well as the acquisition and construction of firearms (promotion of firearm usage falls under Level 2, possession of firearms falls under Level 1).

Content that is severe enough to be flagged as Level 3 includes:

  • Text describing, promoting, or encouraging the buying, selling, or trading of any of the following:
    • Explosives
    • Firearms and firearm accessories (ammunition, scopes, etc.)
  • Text asking for information regarding the acquisition of weapons of any kind
  • Text describing the making or modification of explosives
  • Text describing the making or modification of firearms
  • Text that clearly describes the possession of explosives
  • Text that explicitly promotes or encourages the use of explosives

Level 2: Buying, selling, and trading of non-explosive weapons

This class captures references to the acquisition, construction, and usage of non-explosive weapons. This includes:

  • Text that describes the buying, selling, or trading of non-explosive weapons, including:
    • Machetes, swords, switchblades, and knives
    • Flamethrowers and pyrotechnic devices
    • Spears and other items
    • Pepper spray and mace
    • Batons and nun-chucks
    • Stun guns and other taser-like devices
  • Text that describes the making or modification of the non-explosive weapons listed above (i.e., How to make a machete step by step)
  • Text that explicitly promotes or encourages the use of the non-explosive weapons listed above
  • Text that explicitly promotes or encourages the use of firearms or firearm parts
  • Text that describes the usage of non-explosive weapons, including firearms

Level 1: Neutral mentions of all weapons

Generally, this class flags references to the possession of non-explosive weapons, gun safety, or sport-based gun use (i.e., skeet shooting or hunting). Mentions of any weapon that are neutral, informational, or discouraging in tone fall under this class as well, as are text phrases that include clear exaggeration or humorous language involving weapon use and thus do not constitute any sort of threat.

Cases that are flagged as Level 1 include:

  • Text phrase includes neutral, educational, or informational descriptions of any weapon (i.e., Bomb threat alert: local school shuts down)
  • Text phrase is a 3rd person description of a weapon-related event with no expressed desire for weapon use (i.e., I saw him with a knife)
  • Text phrase mentions weapons as part of an exaggeration or joke (i.e., Can you call in a bomb threat at my job tomorrow I don’t feel like going in?)
  • Text that discourages weapon use (i.e., Please don't shoot your gun at me, We should all work together to stop bombs from being acquired)
  • Descriptions of the possession of non-explosive weapons, including firearms
  • Text that mentions weapons in an unclear context, where words like “bomb” or “shoot” could hold different meanings (i.e., Can you still bomb with your broken finger, You should have pulled the trigger the first time)
  • Mentions of the Second Amendment
  • Descriptions of gun usage in a sport context, such as in skeet shooting or hunting
  • Text promoting and encouraging gun safety
  • Text promoting businesses that legally sell weapons, such as gun stores (i.e., Come in to Benny's Gun Shop!)

Level 0: No mentions of weapons

This class captures all other content with no mention of weapons, or text that uses language associated with weapons in a non-weapon context. For clarity, this includes:

  • Text with no mention of weapons
  • Content that uses words like shoot, bomb, and other weapon-related language in a non-weapon context (i.e., He shot that ball right in the net , I think that tastes so bomb )

Child Exploitation Model Head

This class flags content that mentions or explicitly alludes to child sexual exploitation, including child pornography (both soliciting and distributing material), child sexual abuse, and soliciting or descriptions of sexual activity with or by children under the age of 18.

Level 3: Child Exploitation

Specifically this includes:

  • Messages that solicit, advertise, or attempt to distribute child pornography. This includes the acronym “CP” if used in the context of sending or receiving links, or alongside other sexual language. References to trading links (“STR”, “S2R” = send to receive), buying links, and distribution platforms (e.g., Dropbox, MEGA) are also flagged if context indicates they are likely to refer to child pornography (e.g., also mentions kids, teens, CP, or age ranges below 18)
  • Messages that solicit or describe sexual activity with underaged individuals or child sexual abuse of any kind. This includes sexualized touching, and non-touch abuse like stripping, voyeurism, photographing pornography, etc.
  • Users that are identifiably underaged (e.g., I’m 14 looking for X) soliciting underaged pornography or sex. This also applies to identifiably underaged users offering to distribute child pornography (e.g., nude photos of themselves or friends)
  • Sexual roleplay or fantasies involving underaged children

Level 0: No Child Exploitation

All other content that does not contain the above is flagged as Level 0. This includes:

  • Cases where there is not enough context to conclude that the message is referring to child sexual exploitation or child pornography. For example, Looking to buy Mega links with no other language associating “Mega links” with child pornography is not enough to trigger a Level 3 classification. Similarly, descriptors like “young” are not sufficient without additional context to infer that “young” means under 18.
  • Sexual content or mentions of pornography involving adults. In the absence of specific language about age, children, teens, etc., we presume that sexual content and messages refer to legal, of-age conduct.

Child Safety Model Head (Beta)

The purpose of this head is to provide school administrators the ability to keep children in schools safe from physical violence. It flags text content that contains a direct or indirect threat to children in a school or school-related setting, including threats of self-harm, violence towards other people, violence towards property, possession of weapons, and reports or safety complaints about other people.

Level 3: Threat to Child Safety

This class flags threats of violence that are located at a school, occur during a school-related activity, or have no specified location and could take place at a school. This includes the following types of violence:

  • Content that discusses self harm, mentions the intent to commit self harm, or encourages self harm
  • Content that mentions violence towards others. This includes:
    - Severe physical violence or sexual violence (e.g., i gonna slice you with this blade in the bathroom)
    - Possible violence (e.g., we could kill her after prom)
    - Calls for violence (e.g., i will punch you, kick that fucking bitch)
    - Mentions of violence in self-defense (e.g., if you try to rob me i will slit your throat)
  • Content that mentions destruction of property (e.g., i will burn the building)
  • Claims that someone possesses a weapon
  • Expressions of intent to possess and/or make a weapon
  • Reports of violent threats or acts by other people (e.g., yesterday there was a kid threatening to shoot up the school)
  • Mentions or threats of robbery or kidnapping
  • Any text that mentions school shootings, school violence, or violence against children

Level 0: No Threat to Child Safety

All content that does not contain a school-related threat to child safety is categorized as level 0. This includes:

  • Violent threats made about politicians, celebrities, professional athletes, or soldiers/armies in global conflicts
  • Content that is purely sexual and does not contain any mention of sexual violence
  • Sexual content where violence is mentioned in a kink context (e.g., I will suck your dick until your dick is bleeding)
  • Text that mentions violence in a purely informational context, such as news reporting, product descriptions, etc.
  • Text that contains a general discussion of violence with no intent to commit a violent act (e.g., Can you spot the sniper?)
  • Discussions of violence that has already happened
  • Commentary on violent situations that contains no intent or will to harm (e.g., He might be stabbed dead on a bar somewhere)
  • Violence mentioned as part of hypothetical situations
  • Negations of violence (e.g., I won’t kms that’s emo)
  • Figures of speech involving violence (e.g., i would kick your ass in softball, I want to die holding her hands ufff ufff, I hope you choke 💗)
  • Mentions of violence against animals
  • Mentions of violence that takes place in video games or plays

Self Harm Model Head (Beta)

This head flags content related to promoting, planning, or carrying out self harm. For the purposes of this head, we define self harm as deliberate injury to oneself, including suicide, nonsuicidal self harm (cutting, burning, etc.), and eating disorders.

Level 3: Mentions of Self Harm

This class flags mentions of committing, inciting, or encouraging self harm and instructions for or coordination of self harm. These mentions must not be ambiguous. Text that lacks enough context to determine whether or not it is referring to self harm or suicide (such as I’m giving up or goodbye) will not be flagged. The following is considered to be level 3:

  • Someone directly stating that they will commit suicide or self harm
  • Someone telling someone else that they should commit suicide or self harm
  • Promoting suicide or self harm by discussing positive effects or outcomes of it
  • Descriptions of how to commit suicide or self harm
  • The use of synonyms and slang terms for suicide, such as toaster bath, or mentions of ritual suicide such as seppuku and hara-kiri
  • Promotion or glorification of eating disorders

Level 0: No Mentions of Self Harm

All content that does not contain references to self harm or mentions of self harm is categorized as level 0. Content that describes self harm in a neutral context (i.e. without mentions or promotions of future action) or reference self harm in a sarcastic or exaggerated manner will also fall into this category. Overall, level 0 includes:

  • Text that doesn’t include any reference to suicide or self harm
  • Neutral descriptions of suicide or self harm that do not discuss or incite any future action, such as mentions of past self harm or news headlines involving self harm (e.g., Gasoline bombs thrown at UK immigration center, suspect commits suicide, Last year I made an attempt, thank god for my friends and family…)
  • Denouncements of suicide or self harm
  • Mentions of harm that is not self-inflicted, including violent threats to others. This will instead be flagged by the Violence head.
  • Mentions of suicide or self harm that do not discuss or incite any future action, even if they are insensitive (e.g., you made jokes about slitting wrists when yk damn well i struggle)
  • Joking or casual mentions of self harm to express frustration, embarrassment, or distaste (e.g., The garbage smells so fucking bad i want to off myself, I got a 49 what the actual fuck I though i did good kms)
  • Questions about whether someone wants to commit suicide or self harm (without promoting it)
  • Text that could be referencing suicide or self harm, but lacks the context to be certain (e.g., Goodbye everyone..., Agh. I'm done. I can't do this anymore)

Promotions Model Head

This class captures content that advertises or promotes a product, service, account, or event when the text also includes a suggested action (e.g., reposting, donating) or an attempt to redirect to another application or platform (e.g., links, contact information).

Level 3: Promotion

Types of messages flagged by the promotions class include:

  • Messages that advertise a giveaway or free service (or spam and phishing attempts disguised as a giveaway or free service), e.g., $5 special on custom videos on my site: https://example.com, Feeling depressed? call our free hotline on (000)-000-0000
  • Requests for follows or subscriptions, e.g.,Want to learn more about Bitcoin? Follow me on Twitter for more awesome content!, I'm on twitter now! Find me at https://example.xyz
  • Soliciting donations, e.g., Send $5 to my Venmo for a chance to win custom-made stuff! https://t.co/CO6RT
  • Promoting events and activities with a call to action (e.g., to repost, attend, meet, or join). For example, Please Repost! Join us at Decker Plaza for the memorial at 10:00pm

Level 0: No Promotion

All other messages that either do not promote/advertise a product, service, etc. or don't attempt to redirect to another platform or prompt the reader to do something will be classified as Level 0 (no promotion). For clarity, this includes:

  • Mentioning follower or subscriber count on a social platform or website without a request to follow/subscribe, e.g., I have 10K followers on Twitter
  • Expressing gratitude for past follows, subscriptions, or donations, e.g., Thanks so much for subscribing to my channel!
  • Mentioning products, services, accounts, or events in casual conversation, e.g., I don't know dude, if you don’t wanna suck at cooking just watch Gordon Ramsay’s youtube videos
  • Jokes, sarcasm, or ironic statements, e.g., For every 10000$ you donate to us, 1$ will go to the homeless people! if you ask where the rest of the money is going? then you are just a right-wing bigot!

Redirection Model Head

This class captures content that includes any type of call to action or encouragement for the reader to go to a specific social media platform, website, or app.

Level 3: Redirection

Types of messages flagged by the redirections class include:

  • Messages are clearly redirecting users to a specific platform, e.g., Click here to download the Facebook app, Click here for my Skype, or Message me on WhatsApp
  • Messages that encourage someone to use a specific platform, e.g., Download TikTok it’s so fun, Why don’t you have Snapchat :(, or Omg get BeReal I’ll add you
  • Messages prompting an action to be taken on a specific platform, e.g., Message me on LinkedIn, Pay me on Venmo, or Follow me on Insta
  • Links to other platforms, even when accompanied by neutral text, e.g. Good morning! https://open.spotify.com/track/4cOdK2wGLETKBW3PvgPWqT?si=218e37fdc7eb4c5e
  • Messages that contain the name of a platform and a username for that platform, e.g., Any musky bros/dads wanna chat? Kik: shwimppasta, SC: shewwon237, or IG @boris_0664

Level 0: No Redirection

All other messages that don't attempt to redirect to or encourage the use of another platform. For clarity, this includes:

  • Asking someone to send their social media information, e.g., Hey babes, when you have a chance, send me your Skype ID 👀
  • Redirections to communicate via email, text, and other methods that are not social media platforms, websites, or communication apps, e.g., Call me at 311-114-873636 or You should email him
  • Messages that aim to redirect the reader but do not mention a specific platform, e.g., Follow me cutie! or Please subscribe to my channel
  • Mentions of platforms that have no redirect or call to action, e.g., Yeah I saw that on TikTok, I added her on ig, or I think Telegram is more user-friendly than WhatsApp
  • Messages that discuss past actions on another platform, e.g., My Twitter got banned or Did you find my Instagram?
  • Messages that negatively mention another platform, e.g., I hate Twitter but I use it anyway or I’ll never join Snapchat

Gibberish Model Head

This model head flags keyboard spam and messages that have no comprehensible meaning.

Level 3: Gibberish

This class flags messages where the entire message is incomprehensible. Examples would be strings along the lines of grljwbrg, dfoibhnlfadknbsdfg.

Level 0: No Gibberish

Messages with that convey an intelligible meaning through words and phrases are not classified as gibberish, including acronyms, emojis, character replacements, and non-Latin text. For clarity, the following counts as a word or phrase:

  • Any string that contains an English word with four letters or more, even if part of a larger string, e.g., qpwelcome-1po (avoids nearly all user names)
  • Recognizable acronyms and slang (e.g., lmfao, stg)
  • Intentional pattern repetition of short base words (e.g., "hahahahaha" and foreign language variants, "pewpewpewpew", "lololololol"). If the base unit being repeated is not a word (e.g., "asd"), then it will be flagged as gibberish
  • Elongated words or acronyms that use extra letters (e.g., "ooooooooof", "lmaooooooo", "wtffff", "no waaaaay")
  • Emojis and stylized text
  • Non-latin text (e.g., Arabic script, Japanese characters)
  • Brand names
  • Text that contains numbers only

Phone Number Model Head

This model head detects phone numbers in message strings, including international formats.

Level 3: Phone Number

This class flags phone numbers using context and syntax to differentiate between other strings of numbers.

Level 0: No Phone Number

All other messages that do not contain a number that is indicated to be or recognizable as a phone number is classified as Level 0: no phone number.

Spam Model Head

Level 3: Spam

This class flags text that contains a link intended to redirect other users to a different site or platform. This includes links and link shorteners, email addresses, and phone numbers. Links to popular and reputable websites, such as news organizations and publications, YouTube, large social platforms, Wikipedia, etc. are not flagged as spam.

Level 0: No Spam

This class is flagged when the text does not contain such a link or includes a link to a popular and reputable website. Other types of messages typically associated with spam content may be flagged by the Promotions model (e.g., advertising a service, platform, or account) or the Gibberish model (unintelligible messages).