Customize Thresholds for Model Classifications
Different platforms have different content policies and risk sensitivities. Thresholds allow the Dashboard to know what categories a post should be classified as based on the conditions you set on this page. The category classifications are also used when setting up User and Post rules.
Moderation Category Thresholds
Default Conditions:
There are six visual and text moderation categories available to configure. Each category has suggested default conditions that can be used, however, these can be changed at any time based on your specific needs. When Thresholds are changed, it will apply to all new posts sent to the Dashboard (does not retroactively apply to previous posts).
Clicking "Reset to Default" will remove all previously added conditions and revert back to the original default conditions.
Difference between Visual and Text model scores
Text and visual categories are configured separately on the Dashboard, since both have different model responses.
Visual classification models classify an entire image into different categories by assigning a confidence score for each class between 0 and 1. You should consider changing default conditions if you are particularly sensitive to either false positive or false negative classifications on your platform.
Hive's text model results are ordered by severity ranging from level 3 (most severe) to level 0 (benign). You should decide which severity level suits your specific community guidelines based on the Multi-level Classification description. By default, only text that returns the highest severity (3) in each class will trigger moderation.
Customize Visual Thresholds:
You can add new conditions for each visual category by clicking "Create New". This will open up a modal where you can select the model class and associated threshold value. You can also click the "+" icon to add additional model classes. All model classes in this dialog box are AND'd together (all thresholds must match for the condition to trigger).
Once you click save, the condition will be added under the category. All condition rows below are OR'd (at least one of the conditions need to match for the post to be categorized)
Customize Text Thresholds:
You can adjust text thresholds by using the slider to match what you consider safe and unsafe content. You can also use the severity descriptions below the slider to guide your decisions. Thresholds will autosave every time you move the slider.
Edge cases:
- Spam and promotion text category scores can be either 0 or 3 (spam/promotion detected). The sliders for these 2 categories only allow 0 or 3 as configurations.
Setting Thresholds for Manual Review
It may be beneficial to have your human moderation team review cases where model classifications are uncertain. The Dashboard allows you to configure score ranges in which posts are automatically routed to the review feed for human confirmation.
Here’s an example implementation for a visual class:
- Automatically delete post when yes_female_underwear > 0.9 (score above 0.9)
- Manual review when 0.7 >= yes_female_underwear >=0.9 (score between 0.7 and 0.9)
To include manual moderation as part of your workflow, you will first need to configure both Visual and Text Review Thresholds.
Creating Review Thresholds:
There are default review thresholds created for both visual and text. A visual post will be classified as Needs Review if the default classes are between a score of 0.7 and 0.9 (ex: 0.7 <= general_suggestive <= 0.9). A text post will be classified as Needs Review if the text class score is either 1 or 2.
You can add new conditions under the Visual/Text tab by clicking "Create New". This will open up a pop-up where you can select the model class and associated threshold value similar to category thresholds.
For text thresholds, you will have to use the slider to configure what scores need human review.
Sending posts to the Review Feed
Once Review thresholds are configured, you will be able to create rules that send your posts scoring in the thresholds range, to the review feed for manual moderation (refer to the Rules page).
Updated about 1 year ago