Datasets

Learn how to upload and edit datasets

Overview

Datasets are the very first step to building an AutoML model. A dataset represents the information that you would like to use for model training or to create an embedding. On our Datasets page, you can preview and edit your information before you use it to train or augment a model.

Create a Dataset

To create a dataset, go to the Datasets dashboard page and click the Create New Dataset button in the top right.

The `Create New Dataset` button lies at the top right of the `Datasets` page.

The Create New Dataset button lies at the top right of the Datasets page.

Upload a File

To begin, you'll be asked to choose a dataset file to upload. We support the following file types:

Structured Data (For Model Trainings)Unstructured Data (For Embeddings)
CSV (.csv)TXT (.txt)
TSV (.tsv)Markdown (.md)
JSON (.jsonl, .json)LaTeX (.tex)
DOCX (.docx)
PDF (.pdf)
RTF (.rtf)

Structured data formats are used for datasets that are intended for model training, while unstructured data formats can only be used to create embeddings. To read more about embeddings, please see our Embeddings page.

Dataset Column Mapping

After uploading a structured data file, you will be prompted to define the column mapping. This page lets you update the column names and column types of your uploaded files. The column type outlines how each column should be interpreted. For example, if a column contains a URL to an image, you should use the Image column type. If it contains raw text data, you should use the Text column type. This step is only applicable to structured data files — unstructured data files, such as those used to create embeddings, have no mapping.

The complete list of column types with descriptions can be found below:

Column TypeDescription
TextRepresents a string of UTF-8 characters. This column type can be used for text data for classification, class categories, and large language model prompts and completions.
ImageRepresents a URL that points to a publicly accessible image in jpeg, webp, gif, or png format. This column type is used to create image datasets to train custom image classifier and vision moderation models.
JSONRepresents a string of structured information in JSON format. This column type is used to train custom large language models using custom roles and conversation history.
Ignore ColumnSelect this column type if you would like to remove the column from the AutoML dataset.
The column type can be selected via a dropdown menu right above the column's name.

The column type can be selected via a dropdown menu right above the column's name.

Preview & Edit a Dataset

After creating a dataset, you can go to the dataset detail page to preview & edit the dataset. You can click the label of any piece of data to edit it. You can also delete any piece of data by selecting the trash icon. If you'd like to add new data to the dataset, click the plus icon that can be found at the top left of the data preview gallery.

The dataset preview page for an image dataset.

The dataset preview page for an image dataset.

Dataset Requirements

All datasets must be:

  1. One of the file types listed above.
  2. UTF-8 encoded.
  3. Composed of up to 20 files.
  4. Each file must be 512MB or smaller.
  5. Smaller than 5GB in total.

For csv, tsv, and jsonl files, a dataset must:

  1. Have fewer than 20 columns.
  2. Have fewer than 100k rows.
  3. Each value cannot be longer than 20k characters.

For image columns, each image must be less than 50MB, and the following image types are supported:

  • jpg
  • jpeg
  • png
  • gif
  • webp

Example Datasets

If you want to try out our AutoMl platform but don't have your own dataset yet, you can use an example dataset as a starting point! You can find example datasets for each model type by selecting Example at the top right corner of the dataset page.

To view example datasets, select `Example` on the top right of the datasets.

To view example datasets, select Example on the top right of the Dataset page.


What’s Next