Datasets

Learn how to upload and edit datasets

Overview

Datasets are the foundational building block for AutoML models. A dataset is the collection of information that will be used for training a custom machine learning model. The AutoML Datasets page offers several convenient features for creating and managing datasets.

Create a Dataset

Create a dataset by navigating to the Datasets page and clicking the Create New Dataset button.

The `Create New Dataset` button lies at the top right of the `Datasets` page.

The Create New Dataset button lies at the top right of the Datasets page.

File Upload

The first step of dataset creation is to upload the file(s) containing the relevant data. AutoML supports various structured and unstructured data sources.

Structured data is typically used for creating snapshots and training models. Unstructured data cannot be used for model training but can be used as dataset function input.

View the full list of supported file types in the Dataset Requirements section.

Column Mapping

After uploading your data, if you have provided structured data you will be prompted to confirm the column mapping. In the column mapping step, you can rename columns, ignore unneeded columns, and update the data types of each column.

The column types define how data should be interpreted to display a dataset or train a model. For example, a column containing image URLs should use the Image type, while a column containing raw text data should use the Text type.

Note: column mapping is not applicable for unstructured data—if you upload only unstructured data you will not be prompted to complete column mapping.

View the complete list of column types below.

Column TypeDescription
TextRepresents a string of UTF-8 characters. This column type can be used for text to classify, ground truth class labels, and large language model prompts and completions.
ImageRepresents a URL that points to a publicly accessible image in jpg/jpeg, png, webp, or gif format. This column type can be used as image inputs to image classification or visual moderation models.
JSONRepresents a string of structured data in JSON format.
IntegerRepresents a whole number (no decimal). This column type can be used as ground truth class labels.
FloatRepresents a decimal number. This column type can be used as ground truth class labels.
BooleanRepresents either true or false. This column type can be used as ground truth class labels.
The column type can be selected via a dropdown menu right above the column's name.

The column type can be selected via a dropdown menu to the right of the column's name.

Preview & Edit a Dataset

After creating a dataset, you can view and edit the data in the dataset’s Live View tab. In the live view, you can update or delete existing rows and add new rows or columns.

The dataset preview page for an image dataset.

The dataset preview page for a text dataset.

Dataset Requirements

Below are the supported file types for structured and unstructured data.

Structured DataUnstructured Data
CSV (.csv)Text (.txt)
TSV (.tsv)Markdown (.md)
JSONL (.jsonl)LaTeX (.tex)
Word (.docx)
PDF (.pdf)
RTF (.rtf)

All datasets must also meet the following requirements.

  1. UTF-8 encoded
  2. Composed of 1-20 files
  3. Max size of 512MB per file
  4. Max total file size of 5GB

Structured data must also meet the below requirements.

  1. 1-20 columns
  2. 1-100,000 rows
  3. Max 20,000 characters per cell value

Finally, image columns must follow additional requirements.

  1. Max size of 50MB per image
  2. Image is supported type (.jpg, .jpeg, .png, .webp, .gif)

What’s Next