Video Captioning


Hive’s Video Caption API generates natural-language descriptions for videos shorter than 30s. For every video input, the model outputs a text string that describes what is shown. The model samples 16 frames from each video and uses them to generate a unified caption that describes not only the subject and scenery, but also movements and actions. The resulting captions are long and descriptive, with a maximum size of 1024 tokens.

Request Format

This API allows you to submit videos either as binary files or as publicly available urls. Here are examples for either submission method:

# submit a task with media with url
curl --request POST \
  --url \
  --header 'accept: application/json' \
  --header 'authorization: token <API_KEY>' \
  --form 'url=http://public_url.mp4'

# submit a task with media with local media file
 curl --request POST \
     --url \
     --header 'Authorization: Token <token>' \
     --form 'media=@"<absolute/path/to/file>"'


The output of our Video Caption API contains the video's caption as a text string. To see an annotated example of an API response object for this model, you can visit our API Reference