LLMs for Text

Build custom large language models for a range of tasks

Overview

Hive AutoML offers training for two different types of large language models (LLMs): text and chat. Text models are geared towards generating passages of writing or lines of code, whereas chat models are built for interactions with the user, often in the format of asking questions and receiving concise, factual answers. Chat models also have the added ability to incorporate conversation history into a prompt. To read about setting up a chat model, see our LLMs for Chat page.

The full process of creating an LLM (Text) model can also be viewed as a video tutorial below:

Creating a New Training

Before you start building your model, you first need to upload any datasets you'll use to the Datasets section of our AutoML platform. Click on the Datasets icon on the left menu bar to open the Datasets page. To add a new dataset, click the large blue Create New Dataset button in the upper right corner of your screen.

Dataset Upload

Your data will be uploaded as a CSV file. To format it, one column (titled prompt) should contain the text prompt and a second column (titled completion) should contain the text response to that prompt. An example of this formatting is shown below:

promptcompletion
Question: In 19th Century Florence, it was illegal for women to wear what?Buttons
Question: What animal cannot stick out its tongue?Crocodile
Question: What color is an airplane's black box?Bright orange
Question: How long did the 100 years war last?116 years
Question: What was the drink Bloody Mary originally called?Bucket of Blood

🚧

The uploaded data file must include both a "prompt" column and a "completion" column or the training will result in an error.

In the example shown above, the text prompt (prompt) includes a trivia question and completion field for that prompt (completion) contains the answer.

In order to be processed correctly, the data file must satisfy the following requirements:

  • Dataset file must be in CSV format.
  • CSV must have a header row (a row of column names above the actual data).
  • CSV must use , as the delimiter. Other delimiters such as ; or | are invalid.
  • The header row must contain columns for both the prompt and the completion called prompt and completion.
  • Each row can contain a maximum of 2048 tokens (approximately 8,000 characters). If a row exceeds this, that row will be considered invalid and will not be used during training. If any rows exceed this limit, we will let users know which ones have done so. If more than 5% of rows in the dataset are invalid, the training will fail entirely.
  • Each CSV file cannot contain more than 50,000 rows.
  • Each CSV must contain at least 100 rows. At least 2,000 rows is strongly recommended for optimal model quality.
  • Any empty columns in the CSV will not be used during model training.

If any of the above are not satisfied, the training will fail and return an error.

🚧

We strongly recommend that the training data includes at least 2,000 rows for optimal model quality.

Test Dataset

You can choose to upload a separate test dataset or split off a random section of your training dataset to use instead. If you choose to upload a separate test dataset, this dataset must also satisfy all of the file requirements listed above. If you choose to split off a section of your training dataset, you will be able to choose the percentage of that dataset that you would like to use for testing as you create your training.

Creating a Training

To start building your model, head to our AutoML platform and select the Create New Model button in the top right corner of the screen. You will be brought to a project setup page where you will be prompted to enter a project name and description. Click the box below Model Type and select Language Generative - Text from the menu that appears. On the right side of the screen, add your training dataset by clicking the Select Dataset button, which allows you to choose from a list of datasets that you've already uploaded to the AutoML platform. After adding your training dataset, you can either select a test dataset or choose to split off a random section of your training dataset to use instead.

The AutoML Training Projects dashboard. The `Create New Project` button sits in the top right corner.

The AutoML Training Projects dashboard. The Create New Model button sits in the top right corner.

Evaluating Model Performance

After model training is complete, viewing the page for that project will provide a few metrics in order to help you evaluate the performance of your model. At the top of the page, you can view both the loss and the token accuracy for the model.

The loss measures how closely the model’s response matches the response from the test data, where 0 represents a perfect prediction, and a higher loss signifies that the prediction is increasingly far from the actual response sequence. If the response has 10 tokens, we let the model predict each of the 10 tokens given all previous tokens are the same. We then display the final numerical loss value, as well as the difference to the loss of the original pre-trained model before fine tuning.

You can also evaluate your model by interacting with it in what we call the playground. Here you can submit prompts directly to your model and view its response, allowing model evaluation through experimentation. This will be available for 15 days after model training is complete, and has a limit of 500 requests. If either the time or request limit is reached, you can instead choose to deploy the model and continue to use the playground feature with unlimited uses which will be charged to the organization's billing account.

❗️

After a model is deployed, any prompts submitted to the playground will be sent through the API and billed as such.

Evaluation metrics for an AutoML project as shown after training has completed.

Evaluation metrics for an AutoML project as shown after training has completed.

If you would like to retrain your model based on these metrics, click the "Update Model" button to the left of the "Create Deployment" button to begin the training process again.

Deploying Model With Hive Data

When you’re happy with your model and ready to deploy it, select the project and click the “Create Deployment” button in the top right corner. The project’s status will shift to “Deploying.” The deployment may take a few minutes.

After the deploy status shows as “Complete,” you can view the deployment by clicking on the "Deployments" tab above the metrics. This will show a list of all deployments for this model.

The "Deployments" tab, showing two different model deployments and their statuses.

The "Deployments" tab, showing two different model deployments and their statuses.

To view any deployment, click its name. This will open the project on Hive Data, where you will be able to upload tasks, view tasks, and access your API key as you would with any other Hive Data project. There will also be a button to "Undeploy" your project, if you wish to deactivate it at any point. Undeploying a model is not permanent — you can redeploy the project if you later choose to.

An Auto ML project as viewed in the Hive Data dashboard. The "API Key" button is on the top right.

An AutoML project as viewed in the Hive customer dashboard. The "API Key" button is on the top right.

To begin using your custom-built API, click on the “API Key” button on the top right of the Hive Data project page to copy your API Key. For instructions on how to submit a task via API, either synchronously or asynchronously, see our API Reference documentation.