Hive’s Video Caption API generates natural-language descriptions for videos shorter than 30s. For every video input, the model outputs a text string that describes what is shown. The model samples 16 frames from each video and uses them to generate a unified caption that describes not only the subject and scenery, but also movements and actions. The resulting captions are long and descriptive, with a maximum size of 1024 tokens.
An example JSON response with no question is shown below:
{
"status": [
{
"status": {
"code": "0",
"message": "SUCCESS"
},
"_version": 2,
"response": {
"response": "a video of a kid playing in a playground. The child is throwing a ball."
}
}
]
}