Multimodal Large Language Models

A guide to our multimodal large language models

Overview

Our multimodal LLMs (large language models) generate natural-language descriptions for image and video. These models have an optional input question which allows the user to ask a question about the input. If a question about the image or video is included in the API call, the response contains an answer to the question instead of a caption.

For an image input, the model outputs a short text string that captions what is shown in the image.

For a video input, the model outputs a caption or answer to the given question for each second of the video. The model is set up so that it runs on a video split endpoint. The endpoint takes in a video, splits it into frames, and runs those frames through the model, thus captioning each individual frame. Likewise, if a question is provided in the initial input, an answer is provided for each individual frame.

These models have many possible applications, such as generating alt text (an HTML attribute that contains a short text description to be displayed in place of an image when that image fails to load). This provides a quick and easy solution for web accessibility.

Models

We currently offer some of Meta’s open-source Llama Vision Instruct models from the 3.2 series, with additional models to be served in the near future.

Here are our current multimodal LLM offerings:

ModelDescription
Llama 3.2 11B Vision InstructLlama 3.2 11B Vision Instruct is an instruction-tuned model optimized for a variety of vision-based use cases. These include but are not limited to: visual recognition, image reasoning and captioning, and answering questions about images.

Request Format

Below are the input fields for a multimodal LLM cURL request. The asterisk (*) next to an input field designates that it is required.

media*: The path to your input image or video.
question: An optional field to ask a question about the input image or video.

Here is an example of a cURL request using the following format:

curl --location 'https://api.thehive.ai/api/v1/task/async' \
--header 'Authorization: Token <YOUR_TOKEN>' \
--form 'media=@"<YOUR_PATH>"' \
--form 'callback_url=<SAMPLE_URL>' \
--form 'options={"question":"who is in the picture?"}'

Response

After making a multimodal LLM cURL request, the response will be a caption as a text string. If a question about the image or video was included in the API call, the response contains an answer instead of a caption. To see an example API response for this model, you can visit our API reference page.