Hive’s Video Caption API generates natural-language descriptions for videos shorter than 30s. For every video input, the model outputs a text string that describes what is shown. The model samples 16 frames from each video and uses them to generate a unified caption that describes not only the subject and scenery, but also movements and actions. The resulting captions are long and descriptive, with a maximum size of 1024 tokens.

An example JSON response with no question is shown below:

{
   "status": [
    {
      "status": {
        "code": "0",
        "message": "SUCCESS"
      },
      "_version": 2,
      "response": {
        "response": "a video of a kid playing in a playground. The child is throwing a ball."
      }
    }
  ]
}