Hive AutoML offers training for two different types of large language models (LLMs): text and chat. Text models are geared towards generating passages of writing or lines of code, whereas chat models are built for interactions with the user, often in the format of asking questions and receiving concise, factual answers. Chat models also have the added ability to incorporate conversation history into a prompt. To read about setting up a text model, see our LLMs for Text page.
Before you start building your model, you first need to upload any datasets you'll use to the
Datasets section of our AutoML platform. Click on the
Datasets icon on the left menu bar to open the
Datasets page. To add a new dataset, click the large blue
Create New Dataset button in the upper right corner of your screen.
Your data will be uploaded as a CSV file. To format it, one column (titled
prompt) should contain the text prompt and a second column (titled
completion) should contain the text response to that prompt. There are also three optional columns that you can include if they are useful to your particular use case:
system_prompt gives a general instruction to the model, such as to
Please answer in a formal tone, that is appended to the prompt to give the model more information about the desired response. The two columns
roles are both in JSON format and give the opportunity to include previous parts of the conversation to provide more context to the current exchange.
An example of this formatting is shown below. Required columns are indicated with an *.
|How old is Le Cobusier?||Le Corbusier, born Charles-Édouard Jeanneret, was a Swiss-French architect...||Please answer in a formal tone||[|
"content": "I'm going to Paris what should I see?",
"content": "Paris, the capital of France, is known for its stunning architecture...",
"content": "Who are some french architects?",
"content": "Here are some notable French architects throughout history: 1) Le Corbusier...",
The uploaded data file must include both a "prompt" column and a "completion" column or the training will result in an error.
In the example shown above, the text prompt (
prompt) includes a trivia question and completion field for that prompt (
completion) contains the answer.
In order to be processed correctly, the data file must satisfy the following requirements:
- Dataset file must be in CSV format.
- CSV must have a header row (a row of column names above the actual data).
- CSV must use
,as the delimiter. Other delimiters such as
- The header row must contain columns for both the prompt and the completion called
- Each row can contain a maximum of 2048 tokens (approximately 8,000 characters). If a row exceeds this, that row will be considered invalid and will not be used during training. If any rows exceed this limit, we will let users know which ones have done so. If more than 5% of rows in the dataset are invalid, the training will fail entirely.
- Each CSV file cannot contain more than 50,000 rows.
- Each CSV must contain at least 100 rows. At least 2,000 rows is strongly recommended for optimal model quality.
- Any empty columns in the CSV will not be used during model training.
If any of the above are not satisfied, the training will fail and return an error.
We strongly recommend that the training data includes at least 2,000 rows for optimal model quality.
You can choose to upload a separate test dataset or split off a random section of your training dataset to use instead. If you choose to upload a separate test dataset, this dataset must also satisfy all of the file requirements listed above. If you choose to split off a section of your training dataset, you will be able to choose the percentage of that dataset that you would like to use for testing as you create your training.
To start building your model, head to our AutoML platform and select the
Create New Model button in the top right corner of the screen. You will be brought to a project setup page where you will be prompted to enter a project name and description. Click the box below
Model Type and select
Language Generative - Chat from the menu that appears. On the right side of the screen, select your training dataset by clicking the
Select Dataset button, which allows you to choose from a list of datasets that you've uploaded to the AutoML platform.
After model training is complete, viewing the page for that project will provide a few metrics in order to help you evaluate the performance of your model. At the top of the page, you can view both the loss and the token accuracy for the model.
loss measures how closely the model’s response matches the response from the test data, where
0 represents a perfect prediction, and a higher loss signifies that the prediction is increasingly far from the actual response sequence. If the response has 10 tokens, we let the model predict each of the 10 tokens given all previous tokens are the same. We then display the final numerical loss value, as well as the difference to the loss of the original pre-trained model before fine tuning.
token accuracy compares the model's response to the completion from the test data, determining how many tokens are the same between the two of them. This is shown as a percentage indicating the probability of an exact match between them. Similarly to
loss, we will contextualize the final result by showing the percent change relative to the original pre-trained model.
You can also evaluate your model by interacting with it in what we call the
playground. Here you can submit prompts directly to your model and view its response, allowing model evaluation through experimentation. This will be available for 15 days after model training is complete, and has a limit of 500 requests. If either the time or request limit is reached, you can instead choose to deploy the model and continue to use the playground feature with unlimited uses which will be charged to the organization's billing account.
After a model is deployed, any prompts submitted to the playground will be sent through the API and billed as such.
If you would like to retrain your model based on these metrics, click the "Update Model" button to the left of the "Create Deployment" button to begin the training process again.
When you’re happy with your model and ready to deploy it, select the project and click the “Create Deployment” button in the top right corner. The project’s status will shift to “Deploying.” The deployment may take a few minutes.
After the deploy status shows as “Complete,” you can view the deployment by clicking on the "Deployments" tab above the metrics. This will show a list of all deployments for this model.
To view any deployment, click its name. This will open the project on Hive Data, where you will be able to upload tasks, view tasks, and access your API key as you would with any other Hive Data project. There will also be a button to "Undeploy" your project, if you wish to deactivate it at any point. Undeploying a model is not permanent — you can redeploy the project if you later choose to.
To begin using your custom-built API, click on the “API Key” button on the top right of the Hive Data project page to copy your API Key. For instructions on how to submit a task via API, either synchronously or asynchronously, see our API Reference documentation.
Updated about 1 month ago