Hive Vision Language Model (VLM)

How to integrate with Hive's latest Vision Language Model.

About

🔑 model key: hive/vision-language-model

The Hive Vision Language Model is trained on Hive’s proprietary data, delivering leading performance with the speed and flexibility required for production vision tasks.

  • Best-in-class moderation – Flags sexual content, hate, drugs, and other moderation classes—even in nuanced edge cases across text and images.
  • Deep multimodal comprehension – Detects fine-grained objects, reads text, and understands spatial and semantic relationships to provide a rich understanding of input images.
  • All-in-one task engine – Generate captions, answer visual questions, run OCR, or analyze image characteristics—all through a single endpoint.

How to Get Started

Authentication is required to use these models. You’ll need an API Key, which can be created in the left sidebar.

Follow these steps to generate your key:

  • Click ‘Playground API Keys’ in the sidebar.
  • Click ‘Create API Key’ to create a new key scoped to your organization. The same key can be used with any "Playground available" model.

⚠️ Important: Keep your API Key safe!

Click '+' to create a new API Key

Click 'Playground API Keys' in the left menu to get started.

Once you've created an API Key, you can submit API requests using the **Secret Key.**  
Please keep your Secret Key safe.

Once you've created an API Key, you can submit API requests using the Secret Key.
Please keep your Secret Key safe.

Querying Hive Vision Language Model

Hive offers an OpenAI-compatible Rest API for querying LLMs and multimodal LLMs. Here are the ways to call it:

  • Using the OpenAI SDK
  • Directly invoking the REST API

Using this API, the model will successively generate new tokens until either the maximum number of output tokens has been reached or if the model’s end-of-sequence (EOS) token has been generated.

Note: Some fields such as top_k are supported via the REST API, but is not supported by the OpenAI SDK.

Currently, the VLM only supports synchronous task submission. We plan to add asynchronous support in the future.

Performance Tips

🕑

Minimizing Latency

Hive VLM tokenizes each image into square "patches."

Images small enough to fit into a single patch (square images ≤ 633x633) cost only 256 input tokens

Larger images are split into up to 6 patches (max 1,536 tokens) which increases latency and price.

For the fastest response, resize or center-crop your images to ≤ 633x633 before calling the API.

🔠

Maximizing OCR Accuracy

For OCR tasks, it is recommended to keep your image patch size higher.

The higher the patch size (up to 6), the more accurate the VLM performs on text-on-image tasks.

👍

For help achieving your specific use case, feel free to reach out to us!

Prompt + Image URL (quick start!)

from openai import OpenAI
client = OpenAI(base_url="https://api.thehive.ai/api/v3/", api_key="<SECRET_KEY>")

completion = client.chat.completions.create(
    model="hive/vision-language-model",
    messages=[
        {"role":"user","content":[
            {"type":"text","text":"Describe the scene in one sentence."},
            {"type":"image_url","image_url":{"url":"https://d24edro6ichpbm.thehive.ai/example-images/vlm-example-image.jpeg"}}
        ]}
    ],
    max_tokens=50,
)
print(completion.choices[0].message.content)
import OpenAI from "openai";
const openai = new OpenAI({ baseURL:"https://api.thehive.ai/api/v3/", apiKey:"<SECRET_KEY>" });

const completion = await openai.chat.completions.create({
  model: "hive/vision-language-model",
  messages: [
    { role: "user", content: [
        { type: "text", text: "Describe the scene in one sentence." },
        { type: "image_url", image_url: { url: "https://d24edro6ichpbm.thehive.ai/example-images/vlm-example-image.jpeg" } }
    ] }
  ],
  max_tokens: 50,
});
console.log(completion.choices[0].message.content);
curl -X POST https://api.thehive.ai/api/v3/chat/completions \
  -H 'Authorization: Bearer <SECRET_KEY>' \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"hive/vision-language-model",
    "max_tokens":50,
    "messages":[
      {"role":"user","content":[
        {"type":"text","text":"Describe the scene in one sentence."},
        {"type":"image_url","image_url":{"url":"https://d24edro6ichpbm.thehive.ai/example-images/vlm-example-image.jpeg"}}
      ]}
    ]
  }'

Prompt + Image Base64

Just swap the value inside "url" to be your encoded base64 string; everything else is identical.

{"type":"image_url","image_url":{"url":"data:image/jpeg;base64,<BASE64_DATA>"}}

Prompt + Image URL with JSON Schema

Use this when you need the model to return structured JSON that conforms to a schema you predefine.

from enum import Enum
from pydantic import BaseModel
from openai import OpenAI

class Subject(str, Enum): person="person"; animal="animal"; vehicle="vehicle"; food="food"; scenery="scenery"
class ClassificationOutput(BaseModel): subject: Subject

client = OpenAI(base_url="https://api.thehive.ai/api/v3/", api_key="<SECRET_KEY>")
completion = client.beta.chat.completions.parse(
    model="hive/vision-language-model",
    messages=[
        {"role":"user","content":[
            {"type":"text","text":"Classify the main subject (person, animal, vehicle, food, scenery). Return JSON only."},
            {"type":"image_url","image_url":{"url":"https://d24edro6ichpbm.thehive.ai/example-images/vlm-example-image.jpeg"}}
        ]}
    ],
    response_format=ClassificationOutput,
    max_tokens=50,
)
print(completion.choices[0].message.parsed.subject)
const schema = {
  type:"object",
  properties:{ subject:{ type:"string", enum:["person","animal","vehicle","food","scenery"] } },
  required:["subject"], additionalProperties:false,
};

const completion = await openai.chat.completions.create({
  model:"hive/vision-language-model",
  messages:[{ role:"user", content:[
    { type:"text", text:"Classify the main subject (person, animal, vehicle, food, scenery). Return JSON only." },
    { type:"image_url", image_url:{ url:"https://d24edro6ichpbm.thehive.ai/example-images/vlm-example-image.jpeg" } }
  ]}],
  response_format:{ type:"json_schema", schema },
  max_tokens:50,
});

console.log(JSON.parse(completion.choices[0].message.content).subject);
curl -X POST https://api.thehive.ai/api/v3/chat/completions \
  -H 'Authorization: Bearer <SECRET_KEY>' \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"hive/vision-language-model",
    "max_tokens":50,
    "response_format":{
      "type":"json_schema",
      "json_schema":{
        "schema":{
          "type":"object",
          "properties":{"subject":{"type":"string","enum":["person","animal","vehicle","food","scenery"]}},
          "required":["subject"],
          "additionalProperties":false
        },
        "strict":true
      }
    },
    "messages":[
      {"role":"user","content":[
        {"type":"text","text":"Classify the main subject (person, animal, vehicle, food, scenery). Return JSON only."},
        {"type":"image_url","image_url":{"url":"https://d24edro6ichpbm.thehive.ai/example-images/vlm-example-image.jpeg"}}
      ]}
    ]
  }'

Response Example

After making a request, you’ll receive a JSON response with the model's output text. Here’s a sample output:

{
  "id": "1234567890-abcdefg",
  "object": "chat.completion",
  "model": "hive/vision-language-model",
  "created": 1749840139221,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{ \"subject\": \"scenery\" }"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 1818,
    "completion_tokens": 11,
    "total_tokens": 1829
  }
}

Note: If you provide a JSON schema, the API still returns a string, which can be parsed into JSON.

Parameter Reference

Input

FieldTypeDefinition
messages array of objectsConversation history. For VLM calls each user message typically contains:
• text prompt
• one image_url object (url = http URL or base-64 data URI).
modelstringRequired. The name of the model to call.
rolestringThe role of the participant in the conversation. Must be system, user, or assistant.
contentstring OR array of objectsYour content string. If array, each object must have a type and corresponding data, as shown in the examples above.
textstringReferenced inside content arrays, containing the text message to be sent.
image_urlobjectContains the image URL or Base64-encoded string, inside the subfield url.
response_formatobjectresponse_format constrains the model response to follow the JSON Schema you define. This setting is very useful if you'd like Hive VLM to follow a specified set of tasks, defined in a JSON.

Note: this setting may increase latency.
max_tokensintOutput token cap. Default 512 (1 – 2048).
temperaturefloatControls randomness (0 = deterministic). Default 0 (0 - 1).
top_pfloatNucleus sampling cutoff. Default 0.1 (0-1).
top_kintLimits sampling to top K tokens. Default 1 (0-1).

Output

FieldTypeDefinition
idstringThe Task ID of the submitted task.
modelstringThe name of the model used.
createdintThe timestamp (in epoch milliseconds) when the task was created.
choicesarray of objectsContains the model’s responses. Each object includes the index, message, and finish_reason.
usageobjectToken counts for prompt, completion, and total.

Those are the only fields you need for most integrations. Everything else remains identical across URL, Base64, and schema variants.

Common Errors

The VLM has a default starting rate limit of 1 request per second. You may see this error below if you submit higher than the rate limit.

To request a higher rate limit please contact us!

{
  "status_code": 429,
  "message": "Too Many Requests"
}

A positive Organization Credit balance is required to continue using Hive Models. Once you run out of credits requests will fail with the following error.

{
  "status_code":405,
  "message":"Your Organization is currently paused. Please check your account balance, our terms and conditions, or contact [email protected] for more information."
}

Pricing

Hive VLM is priced per input and output token. Because latency also scales with tokens, smaller images → fewer patches → faster responses.

For image tokenization, the following logic is used:

  1. Start with the image’s aspect ratio and area.
    1. Example: a 1,024 × 1,024 picture has ratio = 1.0 and area = 1M px.
  2. Consider every logical way to slice the picture into tiles.
    1. A “tile” is a square crop that the model turns into a fixed 256-token chunk.
    2. We allow from 1 tile up to 6 tiles in total.
      So the grids considered are 1 × 1, 1 × 2, 2 × 1, 2 × 2, 1 × 3, 3 × 1 …
  3. Rank those grids by two rules (in order):
    1. Match the image's original aspect ratio.
      e.g., a 3 × 2 grid (ratio = 1.5) fits an 800 × 600 image (ratio ≈ 1.33) better than a 2 × 2 grid (ratio = 1.0).
    2. If two grids tie on ratio, pick the one that uses more tiles.
  4. Count how many tiles the chosen grid has, and multiply by 256 tokens per tile.
  5. Finally, add 260 more tokens to images with greater than 1 patch.
Image resolutionChosen gridTilesTokens
633x633, or smaller squares
(fastest latency)
1x11Only 260 tokens
1024x10242x24(4x256) + 260 = 1284 tokens
800x6003x26(6x256) + 260 = 1796 tokens