Hive’s speech-to-text model outputs a transcript and timestamps for each word in the prediction.

{
  "output": [
    {
      "transcript": " Like hey you ...",
      "words": [
        {
          "time": 0.08,
          "meta": {},
          "type": "pronunciation",
          "alternatives": [
            {
              "text": "Like",
              "score": 0.4093419909477234
            }
          ]
        },
        {
          "time": 0.72,
          "meta": {},
          "type": "pronunciation",
          "alternatives": [
            {
              "text": "hey",
              "score": 0.5025633573532104
            }
          ]
        },
        {
          "time": 0.96,
          "meta": {},
          "type": "pronunciation",
          "alternatives": [
            {
              "text": "you",
              "score": 0.6408597230911255
            }
          ]
        }
      ]
    }
  ]
}

Name

Description

transcript

Transcript of entire video or audio clip at once.

time

Timestamp in seconds for each predicted word or punctuation in the transcript.

type

pronunciation: If the predicted character string is a word.
punctuation: If the predicted character string is a punctuation.

text

Predicted character string at that timestamp.

score

Confidence score for the predicted character string.

alternatives

List of alternative word predictions at each timestamp.