Our multimodal LLMs (large language models) generate natural-language descriptions for image and video. These models have an optional input question which allows the user to ask a question about the input. If a question about the image or video is included in the API call, the response contains an answer to the question instead of a caption.

For an image input, the model outputs a short text string that captions what is shown in the image.

For a video input, the model outputs a caption or answer to the given question for each second of the video. The model is set up so that it runs on a video split endpoint. The endpoint takes in a video, splits it into frames, and runs those frames through the model, thus captioning each individual frame. Likewise, if a question is provided in the initial input, an answer is provided for each individual frame.

Image Input Response

An example response for an image input without a user-provided question in the input is shown below. If the user doesn’t provide a question, a caption is generated.

{
    "id": "d3cdfdd0-9bd0-11ef-b0e2-5b8fce7d994e",
    "project_id": 1101962924,
    "metadata": "forest.jpg",
    "from_cache": false,
    "status": [
        {
            "status": {
                "code": 0,
                "message": "SUCCESS"
            },
            "response": {
                "input": {
                    "id": "d3cdfdd0-9bd0-11ef-b0e2-5b8fce7d994e",
                    "project_id": 1101962924,
                    "created_on": "2024-11-05T23:51:08.077Z",
                    "user_id": 3121654,
                    "charge": 0,
                    "model": "llama3_11b_vision_instruct_v02",
                    "model_version": 1,
                    "model_type": "LARGE_LANGUAGE_MODEL",
                    "media": {
                        "url": null,
                        "filename": "forest.jpg",
                        "type": "PHOTO",
                        "mime_type": "jpeg",
                        "mimetype": "image/jpeg",
                        "width": 1280,
                        "height": 852,
                        "num_frames": 1,
                        "duration": 0
                    },
                    "question": "Describe this image in a few sentences"
                },
                "output": [
                    {
                        "total_output_tokens": 87,
                        "total_tokens": 105,
                        "task_units_per_input_token": 1,
                        "task_units_per_output_token": 1,
                        "response": "This serene image depicts a tranquil lake scene, with a small white rowboat floating on the calm water. The boat is situated near the shore, surrounded by a rocky shoreline and a lush forest of trees in the background. The atmosphere is peaceful, with a gentle mist rising from the water, creating a sense of stillness and serenity.",
                        "input_tokens": 18,
                        "time": 0
                    }
                ]
            }
        }
    ]
}

An example response for an image input with a user-provided question in the input is shown below. If the user provides a question, the response contains an answer instead of a caption.

{
    "id": "314f4ea0-9bbd-11ef-843a-a7d06cbadea8",
    "project_id": 1101962924,
    "metadata": "forest.jpg",
    "from_cache": false,
    "status": [
        {
            "status": {
                "code": 0,
                "message": "SUCCESS"
            },
            "response": {
                "input": {
                    "id": "314f4ea0-9bbd-11ef-843a-a7d06cbadea8",
                    "project_id": 1101962924,
                    "created_on": "2024-11-05T21:30:35.018Z",
                    "user_id": 3121654,
                    "charge": 0,
                    "model": "llama3_11b_vision_instruct_v02",
                    "model_version": 1,
                    "model_type": "LARGE_LANGUAGE_MODEL",
                    "media": {
                        "url": null,
                        "filename": "forest.jpg",
                        "type": "PHOTO",
                        "mime_type": "jpeg",
                        "mimetype": "image/jpeg",
                        "width": 1280,
                        "height": 852,
                        "num_frames": 1,
                        "duration": 0
                    },
                    "question": "What feelings are evoked by this image?"
                },
                "output": [
                    {
                        "task_units_per_output_token": 1,
                        "response": "This image evokes feelings of serenity, tranquility, and peacefulness. The stillness of the water, the mist rising from it, and the lush greenery surrounding the lake create a sense of calmness. The image also suggests a connection to nature and the beauty of the natural world. Overall, the image conveys a sense of relaxation and rejuvenation, inviting the viewer to step into its serene atmosphere.",
                        "input_tokens": 20,
                        "total_output_tokens": 105,
                        "total_tokens": 125,
                        "task_units_per_input_token": 1,
                        "time": 0
                    }
                ]
            }
        }
    ]
}

Video Input Response

An example response for a video input with a user-provided question in the input is shown below. If the user provides a question, the response contains an answer instead of a caption.

If the user doesn’t provide a question, a caption is generated instead.

{
    "id": "f7fecf80-9bcc-11ef-a6c1-e97be098d8aa",
    "project_id": 1101962924,
    "metadata": "nature.mp4",
    "from_cache": false,
    "status": [
        {
            "status": {
                "code": 0,
                "message": "SUCCESS"
            },
            "response": {
                "input": {
                    "id": "f7fecf80-9bcc-11ef-a6c1-e97be098d8aa",
                    "project_id": 1101962924,
                    "created_on": "2024-11-05T23:23:30.808Z",
                    "user_id": 3121654,
                    "charge": 0,
                    "model": "llama3_11b_vision_instruct_v02",
                    "model_version": 1,
                    "model_type": "LARGE_LANGUAGE_MODEL",
                    "media": {
                        "url": null,
                        "filename": "nature.mp4",
                        "type": "VIDEO",
                        "mime_type": "mp4",
                        "mimetype": "video/mp4",
                        "width": 4096,
                        "height": 2048,
                        "num_frames": 13,
                        "duration": 12.011667
                    },
                    "question": "What feelings are evoked by this image?"
                },
                "output": [
                    {
                        "response": "The image evokes feelings of serenity, tranquility, and awe. The lush greenery and the majestic waterfall create a sense of calmness and peacefulness, while the misty atmosphere adds to the mystical and enchanting ambiance. The image also conveys a sense of wonder and appreciation for nature's beauty, inviting the viewer to step into the serene and idyllic world depicted.",
                        "input_tokens": 20,
                        "total_output_tokens": 99,
                        "total_tokens": 119,
                        "task_units_per_input_token": 1,
                        "task_units_per_output_token": 1,
                        "time": 0
                    },
                    {
                        "time": 0.9592916666666667, 
                     //...
                    {
                        "response": "This image evokes feelings of serenity, tranquility, and awe. The majestic waterfall, surrounded by lush greenery and towering trees, creates a sense of wonder and peacefulness. The misty atmosphere adds to the mystical quality of the scene, inviting the viewer to step into a world of natural beauty and tranquility.",
                        "input_tokens": 20,
                        "total_output_tokens": 86,
                        "total_tokens": 106,
                        "task_units_per_input_token": 1,
                        "task_units_per_output_token": 1,
                        "time": 11.970291666666666
                    }
                ]
            }
        }
    ]
}