Functions
A quick guide to dataset functions
Overview
Dataset functions support common machine learning workflows outside of model training. Functions can be used independently or chained together to accomplish simple tasks such as inference requests and complex workflows like active learning. Details on currently supported functions can be found below.
Function Triggers
Trigger | Description |
---|---|
Run Now | Applies the function to all existing rows in the source dataset when the function is created. |
Insert Row | Applies the function to new data when it is added to the source dataset. |
Update Row | Applies the function when existing rows in the source dataset are updated. |
Delete Row | Deletes corresponding output dataset rows when rows in the source dataset are deleted. |
Inference Request Function
Input
Inference Model: The name or API key of the Hive model to send inference requests to. Regular inference charges apply.
Input Column: The dataset column to send in the inference requests. One request per row will be sent.
Output
The Inference Request function outputs the model response JSON, storing it in the specified column of the destination dataset.
Extract from File Function
Input
Extraction Source: The files in the source dataset to be extracted into plain text files.
Output
The Extract from File function outputs plain text files containing the text contents of the input files into the specified column of the destination dataset.
Chunk Text Function
Input
Chunk Source: The dataset column containing plain text files or raw text to be broken into smaller chunks of text. Typically this would be plain text files created by the Extract from File function.
Chunk Size: The maximum number of tokens in each outputted chunk. Token sizes vary, but tokens roughly map to words and are typically 3-4 characters long.
Overlap: The number of tokens shared between subsequent chunks. Overlap helps preserve context, improve retrieval, and reduce information loss in retrieval augmented generation (RAG), a typical use case for text chunks.
Trim Whitespace: Whether or not unnecessary spaces, tabs, newlines, and other whitespace should be removed from chunks or preserved.
Output
The Chunk Text file outputs strings of text representing each chunk in the specified column of the destination dataset. Each input row of the Chunk Text function may yield multiple output rows in the destination dataset.
Update Custom Index Function
Input
Custom Index: The name or API key of the Hive custom index project to update based on changes in the source dataset. Regular custom index charges apply.
Custom Index Source: The dataset column to insert into and/or remove from the custom index. Typically this would be text chunks created by the Chunk Text function.
Output
The Update Custom Index function outputs the custom index response JSON, storing it in the specified column of the destination dataset.
Extract from JSON Function
Input
Extraction Source: The dataset column of JSON objects to extract data from.
JSON Path: The absolute or wildcard path at which to extract the JSON value.
Output
The Extract from JSON function outputs the JSON value found at the given JSON path within the JSON value for each row of the given extraction source, storing the output in the specified column of the destination dataset.
Unnest Array Function
Input
Unnesting Source: The dataset column of JSON arrays to unnest into individual values.
Output
The Unnest Array function outputs the individual values from the source arrays into the specified column of the destination dataset. Each input row of the Unnest Array function may yield multiple output rows in the destination dataset.
Updated 2 months ago