Functions

A quick guide to dataset functions

Overview

Dataset functions support common machine learning workflows outside of model training. Functions can be used independently or chained together to accomplish simple tasks such as inference requests and complex workflows like active learning. Details on currently supported functions can be found below.

Function Triggers

TriggerDescription
Run NowApplies the function to all existing rows in the source dataset when the function is created.
Insert RowApplies the function to new data when it is added to the source dataset.
Update RowApplies the function when existing rows in the source dataset are updated.
Delete RowDeletes corresponding output dataset rows when rows in the source dataset are deleted.

Inference Request Function

Input

Inference Model: The name or API key of the Hive model to send inference requests to. Regular inference charges apply.

Input Column: The dataset column to send in the inference requests. One request per row will be sent.

Output

The Inference Request function outputs the model response JSON, storing it in the specified column of the destination dataset.

Extract from File Function

Input

Extraction Source: The files in the source dataset to be extracted into plain text files.

Output

The Extract from File function outputs plain text files containing the text contents of the input files into the specified column of the destination dataset.

Chunk Text Function

Input

Chunk Source: The dataset column containing plain text files or raw text to be broken into smaller chunks of text. Typically this would be plain text files created by the Extract from File function.

Chunk Size: The maximum number of tokens in each outputted chunk. Token sizes vary, but tokens roughly map to words and are typically 3-4 characters long.

Overlap: The number of tokens shared between subsequent chunks. Overlap helps preserve context, improve retrieval, and reduce information loss in retrieval augmented generation (RAG), a typical use case for text chunks.

Trim Whitespace: Whether or not unnecessary spaces, tabs, newlines, and other whitespace should be removed from chunks or preserved.

Output

The Chunk Text file outputs strings of text representing each chunk in the specified column of the destination dataset. Each input row of the Chunk Text function may yield multiple output rows in the destination dataset.

Update Custom Index Function

Input

Custom Index: The name or API key of the Hive custom index project to update based on changes in the source dataset. Regular custom index charges apply.

Custom Index Source: The dataset column to insert into and/or remove from the custom index. Typically this would be text chunks created by the Chunk Text function.

Output

The Update Custom Index function outputs the custom index response JSON, storing it in the specified column of the destination dataset.

Extract from JSON Function

Input

Extraction Source: The dataset column of JSON objects to extract data from.

JSON Path: The absolute or wildcard path at which to extract the JSON value.

Output

The Extract from JSON function outputs the JSON value found at the given JSON path within the JSON value for each row of the given extraction source, storing the output in the specified column of the destination dataset.

Unnest Array Function

Input

Unnesting Source: The dataset column of JSON arrays to unnest into individual values.

Output

The Unnest Array function outputs the individual values from the source arrays into the specified column of the destination dataset. Each input row of the Unnest Array function may yield multiple output rows in the destination dataset.