Hive’s Short Caption API generates natural-language descriptions for images that are up to 32 tokens (around 64 characters) in length. For every image input, the model outputs a short text string that describes what is shown in that image. This model also accepts questions about the image as an optional input, and the answer to that question will be included in the model response (i.e., what color is the cat? or how many cats are in the image?).

The output of our Short Caption API contains the image's caption as a text string in the response field. If a question about the image was included in the API call, the response field instead contains the answer to that question. In other words, responses to API calls that included a question do not contain a caption, only the question and answer.

An example JSON response with no question is shown below:

{
   "status": [
    {
      "status": {
        "code": "0",
        "message": "SUCCESS"
      },
      "_version": 2,
      "response": {
        "question": "",
        "response": "a kid playing in a playground"
      }
    }
  ]
}

An example JSON response with a question:

{
   "status": [
    {
      "status": {
        "code": "0",
        "message": "SUCCESS"
      },
      "_version": 2,
      "response": {
        "question": "how old is the kid?",
        "response": "he is a young boy"
      }
    }
  ]
}