Hive Vision Language Model (VLM)
How to integrate with Hive's latest Vision Language Model.
About
🔑 model key: hive/vision-language-model
The Hive Vision Language Model is trained on Hive’s proprietary data, delivering leading performance with the speed and flexibility required for production vision tasks.
- Best-in-class moderation – Flags sexual content, hate, drugs, and other moderation classes—even in nuanced edge cases across text and images.
- Deep multimodal comprehension – Detects fine-grained objects, reads text, and understands spatial and semantic relationships to provide a rich understanding of input images.
- All-in-one task engine – Generate captions, answer visual questions, run OCR, or analyze image characteristics—all through a single endpoint.
How to Get Started
Authentication is required to use these models. You’ll need an API Key, which can be created in the left sidebar.
Follow these steps to generate your key:
- Click ‘API Keys’ in the sidebar.
- Click ‘+’ to create a new key scoped to your organization. The same key can be used with any "Playground available" model.
⚠️ Important: Keep your API Key secure. Do not expose it in client-side environments like browsers or mobile apps.

Click '+' to create a new API Key
Querying Hive Vision Language Model
Hive offers an OpenAI-compatible Rest API for querying LLMs and multimodal LLMs. Here are the ways to call it:
- Using the OpenAI SDK
- Directly invoking the REST API
Using this API, the model will successively generate new tokens until either the maximum number of output tokens has been reached or if the model’s end-of-sequence (EOS) token has been generated.
Note: Some fields such as top_k are supported via the REST API, but is not supported by the OpenAI SDK.
For help achieving your specific use case, feel free to reach out to us!
Using the OpenAI SDK
from enum import Enum
from pydantic import BaseModel
from openai import OpenAI
# ── Client setup ───────────────────────────────────────────────────────────
client = OpenAI(
base_url="https://api.thehive.ai/api/v3/", # Hive's endpoint
api_key="<YOUR-SECRET-KEY>" # ← replace with your key
)
# ── 1 · Define enum + response schema ──────────────────────────────────────
class SubjectLabel(str, Enum):
person = "person"
animal = "animal"
vehicle = "vehicle"
food = "food"
scenery = "scenery"
class ClassificationOutput(BaseModel):
subject: SubjectLabel
# ── 2 · Call the VLM and parse the JSON directly into our schema ───────────
completion = client.beta.chat.completions.parse(
model="hive/vision-language-model",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": (
"Classify the **main subject** of this image as one of: "
"person, animal, vehicle, food, or scenery. "
"Return JSON only."
),
},
{
"type": "image_url",
"image_url": {
"url": (
"https://d24edro6ichpbm.thehive.ai/"
"example-images/vlm-example-image.jpeg"
)
},
},
],
}
],
response_format=ClassificationOutput, # 👈 schema enforced by Hive
max_tokens=50,
)
# ── 3 · Typed result ready for use ─────────────────────────────────────────
result: ClassificationOutput = completion.choices[0].message.parsed
print(result.subject) # e.g. → SubjectLabel.scenery
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "https://api.thehive.ai/api/v3/", // Hive endpoint
apiKey: "<YOUR-SECRET-KEY>", // ← replace with your key
});
/* ── 1 · JSON-Schema that tells Hive what we expect back ─────────────── */
const classificationSchema = {
type: "object",
properties: {
subject: {
type: "string",
enum: ["person", "animal", "vehicle", "food", "scenery"],
},
},
required: ["subject"],
additionalProperties: false,
};
/* ── 2 · Call the VLM & let Hive return JSON that fits the schema ─────── */
const completion = await openai.chat.completions.create({
model: "hive/vision-language-model",
messages: [
{
role: "user",
content: [
{
type: "text",
text:
"Classify the **main subject** of this image as one of: " +
"person, animal, vehicle, food, or scenery. " +
"Return JSON only.",
},
{
type: "image_url",
image_url: {
url:
"https://d24edro6ichpbm.thehive.ai/" +
"example-images/vlm-example-image.jpeg",
},
},
],
},
],
response_format: { type: "json_schema", schema: classificationSchema },
max_tokens: 50,
});
/* ── 3 · Use the structured result ─────────────────────────────────────── */
const { subject } = JSON.parse(completion.choices[0].message.content);
// -> e.g. "scenery"
console.log("Detected subject:", subject);
Directly invoking the REST API
The Hive Vision Language Model can be called via REST API, and media can be sent in as:
- image URL
- base64 encoding
cURL examples:
curl --location --request POST 'https://api.thehive.ai/api/v3/chat/completions' \
--header 'authorization: Bearer <SECRET_KEY>' \
--header 'Content-Type: application/json' \
--data-binary $'{
"model": "hive/vision-language-model",
"max_tokens": 50,
"response_format": {
"type": "json_schema",
"json_schema": {
"schema": {
"type": "object",
"properties": {
"subject": {
"type": "string",
"enum": ["person", "animal", "vehicle", "food", "scenery"]
}
},
"required": ["subject"],
"additionalProperties": false
},
"strict": true
}
},
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Classify the main subject of this image as one of: person, animal, vehicle, food, or scenery. Return JSON only."
},
{
"type": "image_url",
"image_url": {
"url": "https://d24edro6ichpbm.thehive.ai/example-images/vlm-example-image.jpeg"
}
}
]
}
]
}'
curl --location --request POST 'https://api.thehive.ai/api/v3/chat/completions' \
--header 'authorization: Bearer <SECRET_KEY>' \
--header 'Content-Type: application/json' \
--data-binary $'{
"model": "hive/vision-language-model",
"max_tokens": 50,
"response_format": {
"type": "json_schema",
"json_schema": {
"schema": {
"type": "object",
"properties": {
"subject": {
"type": "string",
"enum": ["person", "animal", "vehicle", "food", "scenery"]
}
},
"required": ["subject"],
"additionalProperties": false
},
"strict": true
}
},
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Classify the main subject of this image as one of: person, animal, vehicle, food, or scenery. Return JSON only."
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,<BASE64_DATA>"
}
}
]
}
]
}'
After making a request, you’ll receive a JSON response with the model's output text. Here’s a sample output:
{
"id": "1234567890-abcdefg",
"object": "chat.completion",
"model": "hive/vision-language-model",
"created": 1749840139221,
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "{ \"subject\": \"scenery\" }"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 1818,
"completion_tokens": 11,
"total_tokens": 1829
}
}
Parameters
Below are the definitions of possible relevant input and output fields. Some fields have default values that will be assigned if the user does not assign a value themselves.
Input
Field | Type | Definition |
---|---|---|
messages | array of objects | Required. A structured array containing the conversation history. Each object includes a role and content. |
model | string | Required. The name of the model to call. |
role | string | The role of the participant in the conversation. Must be system, user, or assistant. |
content | string OR array of objects | Your content string. If array, each object must have a type and corresponding data, as shown in the examples above. |
text | string | Referenced inside content arrays, containing the text message to be sent. |
image_url | object | Contains the image URL or Base64-encoded string, inside the subfield url. |
response_format | object | response_format constrains the model response to follow the JSON Schema you define.Note: this setting increases latency significantly. |
max_tokens | int | Limits the number of tokens in the output. Range: [1 to 2048] Default: 512 |
temperature | float | Controls randomness in the output. Lower values make output more deterministic. A value of 0 means that VLM outputs are deterministic. Range: [0 to 1] Default: 0 |
top_p | float | Nucleus sampling parameter to limit the probability space of token selection. Range: [0 to 1] Default: 0.1 |
top_k | int | Limits token sampling to the top K most probable tokens. Default: 1 |
Output
Field | Type | Definition |
---|---|---|
id | string | The Task ID of the submitted task. |
model | string | The name of the model used. |
created | int | The timestamp (in epoch milliseconds) when the task was created. |
choices | array of objects | Contains the model’s responses. Each object includes the index, message, and finish_reason. |
usage | object | Contains input/output token usage information for the request and response. |
Response Formats
Many moderation and general vision use-cases need JSON-readable answers, not free-form text.
response_format
lets you tell the VLM exactly what JSON shape to return by embedding a JSON Schema in your request. The model then constrains its output against that schema, so you can parse the response with confidence.
Note that adding a response_format
will significantly increase your latency.
"response_format": {
"type": "json_schema",
"json_schema": {
"schema": {
"type": "object",
"properties": {
"subject": {
"type": "string",
"enum": ["person", "animal", "vehicle", "food", "scenery"]
}
},
"required": ["subject"],
"additionalProperties": false
},
"strict": true
}
},
Typical response:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "{\"subject\":\"scenery\"}"
}
}
],
…
}
Contact us if you have any questions with writing JSON response formats.
Common Errors
The VLM has a default starting rate limit of 1 request per second. You may see this error below if you submit higher than the rate limit.
To request a higher rate limit please contact us!
{
"status_code": 429,
"message": "Too Many Requests"
}
A positive Organization Credit balance is required to continue using Hive Models. Once you run out of credits requests will fail with the following error.
{
"status_code":405,
"message":"Your Organization is currently paused. Please check your account balance, our terms and conditions, or contact [email protected] for more information."
}
Pricing
Hive VLM is priced per input and output token. For image tokenization, the following logic is used:
- Start with the image’s aspect ratio and area.
- Example: a 1,024 × 1,024 picture has ratio = 1.0 and area = 1M px.
- Consider every logical way to slice the picture into tiles.
- A “tile” is a square crop that the model turns into a fixed 256-token chunk.
- We allow from 1 tile up to 6 tiles in total.
So the grids considered are 1 × 1, 1 × 2, 2 × 1, 2 × 2, 1 × 3, 3 × 1 …
- Rank those grids by two rules (in order):
- Match the image's original aspect ratio.
e.g., a 3 × 2 grid (ratio = 1.5) fits an 800 × 600 image (ratio ≈ 1.33) better than a 2 × 2 grid (ratio = 1.0). - If two grids tie on ratio, pick the one that uses more tiles.
- Match the image's original aspect ratio.
- Count how many tiles the chosen grid has, and multiply by 256 tokens per tile.
- Finally, add 2 more tokens.
Image resolution | Chosen grid | Tiles | Tokens |
---|---|---|---|
1024x1024 | 2x2 | 4 | (4x256) + 2 = 1026 |
800x600 | 3x2 | 6 | (6x256) + 2 = 1538 |
Updated about 10 hours ago