Our multimodal LLMs (large language models) generate natural-language descriptions for image and video. These models have an optional input question
which allows the user to ask a question about the input. If a question about the image or video is included in the API call, the response contains an answer to the question instead of a caption.
For an image input, the model outputs a short text string that captions what is shown in the image.
For a video input, the model outputs a caption or answer to the given question for each second of the video. The model is set up so that it runs on a video split endpoint. The endpoint takes in a video, splits it into frames, and runs those frames through the model, thus captioning each individual frame. Likewise, if a question is provided in the initial input, an answer is provided for each individual frame.
Image Input Response
An example response for an image input without a user-provided question in the input is shown below. If the user doesn’t provide a question, a caption is generated.
{
"id": "d3cdfdd0-9bd0-11ef-b0e2-5b8fce7d994e",
"project_id": 1101962924,
"metadata": "forest.jpg",
"from_cache": false,
"status": [
{
"status": {
"code": 0,
"message": "SUCCESS"
},
"response": {
"input": {
"id": "d3cdfdd0-9bd0-11ef-b0e2-5b8fce7d994e",
"project_id": 1101962924,
"created_on": "2024-11-05T23:51:08.077Z",
"user_id": 3121654,
"charge": 0,
"model": "llama3_11b_vision_instruct_v02",
"model_version": 1,
"model_type": "LARGE_LANGUAGE_MODEL",
"media": {
"url": null,
"filename": "forest.jpg",
"type": "PHOTO",
"mime_type": "jpeg",
"mimetype": "image/jpeg",
"width": 1280,
"height": 852,
"num_frames": 1,
"duration": 0
},
"question": "Describe this image in a few sentences"
},
"output": [
{
"total_output_tokens": 87,
"total_tokens": 105,
"task_units_per_input_token": 1,
"task_units_per_output_token": 1,
"response": "This serene image depicts a tranquil lake scene, with a small white rowboat floating on the calm water. The boat is situated near the shore, surrounded by a rocky shoreline and a lush forest of trees in the background. The atmosphere is peaceful, with a gentle mist rising from the water, creating a sense of stillness and serenity.",
"input_tokens": 18,
"time": 0
}
]
}
}
]
}
An example response for an image input with a user-provided question in the input is shown below. If the user provides a question, the response contains an answer instead of a caption.
{
"id": "314f4ea0-9bbd-11ef-843a-a7d06cbadea8",
"project_id": 1101962924,
"metadata": "forest.jpg",
"from_cache": false,
"status": [
{
"status": {
"code": 0,
"message": "SUCCESS"
},
"response": {
"input": {
"id": "314f4ea0-9bbd-11ef-843a-a7d06cbadea8",
"project_id": 1101962924,
"created_on": "2024-11-05T21:30:35.018Z",
"user_id": 3121654,
"charge": 0,
"model": "llama3_11b_vision_instruct_v02",
"model_version": 1,
"model_type": "LARGE_LANGUAGE_MODEL",
"media": {
"url": null,
"filename": "forest.jpg",
"type": "PHOTO",
"mime_type": "jpeg",
"mimetype": "image/jpeg",
"width": 1280,
"height": 852,
"num_frames": 1,
"duration": 0
},
"question": "What feelings are evoked by this image?"
},
"output": [
{
"task_units_per_output_token": 1,
"response": "This image evokes feelings of serenity, tranquility, and peacefulness. The stillness of the water, the mist rising from it, and the lush greenery surrounding the lake create a sense of calmness. The image also suggests a connection to nature and the beauty of the natural world. Overall, the image conveys a sense of relaxation and rejuvenation, inviting the viewer to step into its serene atmosphere.",
"input_tokens": 20,
"total_output_tokens": 105,
"total_tokens": 125,
"task_units_per_input_token": 1,
"time": 0
}
]
}
}
]
}
Video Input Response
An example response for a video input with a user-provided question in the input is shown below. If the user provides a question, the response contains an answer instead of a caption.
If the user doesn’t provide a question, a caption is generated instead.
{
"id": "f7fecf80-9bcc-11ef-a6c1-e97be098d8aa",
"project_id": 1101962924,
"metadata": "nature.mp4",
"from_cache": false,
"status": [
{
"status": {
"code": 0,
"message": "SUCCESS"
},
"response": {
"input": {
"id": "f7fecf80-9bcc-11ef-a6c1-e97be098d8aa",
"project_id": 1101962924,
"created_on": "2024-11-05T23:23:30.808Z",
"user_id": 3121654,
"charge": 0,
"model": "llama3_11b_vision_instruct_v02",
"model_version": 1,
"model_type": "LARGE_LANGUAGE_MODEL",
"media": {
"url": null,
"filename": "nature.mp4",
"type": "VIDEO",
"mime_type": "mp4",
"mimetype": "video/mp4",
"width": 4096,
"height": 2048,
"num_frames": 13,
"duration": 12.011667
},
"question": "What feelings are evoked by this image?"
},
"output": [
{
"response": "The image evokes feelings of serenity, tranquility, and awe. The lush greenery and the majestic waterfall create a sense of calmness and peacefulness, while the misty atmosphere adds to the mystical and enchanting ambiance. The image also conveys a sense of wonder and appreciation for nature's beauty, inviting the viewer to step into the serene and idyllic world depicted.",
"input_tokens": 20,
"total_output_tokens": 99,
"total_tokens": 119,
"task_units_per_input_token": 1,
"task_units_per_output_token": 1,
"time": 0
},
{
"time": 0.9592916666666667,
//...
{
"response": "This image evokes feelings of serenity, tranquility, and awe. The majestic waterfall, surrounded by lush greenery and towering trees, creates a sense of wonder and peacefulness. The misty atmosphere adds to the mystical quality of the scene, inviting the viewer to step into a world of natural beauty and tranquility.",
"input_tokens": 20,
"total_output_tokens": 86,
"total_tokens": 106,
"task_units_per_input_token": 1,
"task_units_per_output_token": 1,
"time": 11.970291666666666
}
]
}
}
]
}