Datasets are the very first step to building an AutoML model. A dataset represents the information that you would like to use for model training or to create an embedding. On our
Datasets page, you can preview and edit your information before you use it to train or augment a model.
To create a dataset, go to the Datasets dashboard page and click the
Create New Dataset button in the top right.
To begin, you'll be asked to choose a dataset file to upload. We support the following file types:
|Structured Data (For Model Trainings)
|Unstructured Data (For Embeddings)
|JSON (.jsonl, .json)
Structured data formats are used for datasets that are intended for model training, while unstructured data formats can only be used to create embeddings. To read more about embeddings, please see our Embeddings page.
After uploading a structured data file, you will be prompted to define the column mapping. This page lets you update the column names and column types of your uploaded files. The column type outlines how each column should be interpreted. For example, if a column contains a URL to an image, you should use the
Image column type. If it contains raw text data, you should use the
Text column type. This step is only applicable to structured data files — unstructured data files, such as those used to create embeddings, have no mapping.
The complete list of column types with descriptions can be found below:
|Represents a string of UTF-8 characters. This column type can be used for text data for classification, class categories, and large language model prompts and completions.
|Represents a URL that points to a publicly accessible image in jpeg, webp, gif, or png format. This column type is used to create image datasets to train custom image classifier and vision moderation models.
|Represents a string of structured information in JSON format. This column type is used to train custom large language models using custom roles and conversation history.
|Select this column type if you would like to remove the column from the AutoML dataset.
After creating a dataset, you can go to the dataset detail page to preview & edit the dataset. You can click the label of any piece of data to edit it. You can also delete any piece of data by selecting the trash icon. If you'd like to add new data to the dataset, click the plus icon that can be found at the top left of the data preview gallery.
All datasets must be:
- One of the file types listed above.
- UTF-8 encoded.
- Composed of up to 20 files.
- Each file must be 512MB or smaller.
- Smaller than 5GB in total.
jsonl files, a dataset must:
- Have fewer than 20 columns.
- Have fewer than 100k rows.
- Each value cannot be longer than 20k characters.
For image columns, each image must be less than 50MB, and the following image types are supported:
If you want to try out our AutoMl platform but don't have your own dataset yet, you can use an example dataset as a starting point! You can find example datasets for each model type by selecting
Example at the top right corner of the dataset page.
Updated 5 days ago