Training Options
A guide to training options for AutoML
AutoML provides default training options that work well for most objectives. You can also customize each option to better suit your use case or just to experiment with different training configurations. See the full set of training options and their descriptions below.
Snapshot Options
Option | Description |
---|---|
Training Snapshot | Data to be used as the training dataset for the model. |
Snapshot Split | Use the existing split in the provided training snapshot, select a new split percentage, or select a separate validation snapshot. |
Validation Snapshot | Data to be used as the validation dataset for the model. |
Test Snapshots | Data to be used as the test dataset for the model. |
Max Invalid Row Percentage | Maximum percentage of invalid rows allowed in the training dataset before failing the training. Rows may be valid for some base models but not others, e.g. a row with 1000 tokens is valid for training Longformer but invalid for training DeBERTa. |
Training Options
Option | Description |
---|---|
Validation Strategy | Determines whether the model is evaluated against the validation set every epoch or some number of steps as defined by the evaluation steps option. An epoch is one iteration through the complete training dataset, while a step is one batch (subset) of the training dataset. |
Total Epochs/Steps | The total number of epochs for model training. More epochs allow for more thorough training and increased accuracy. Fewer epochs reduce overfitting risk but may lead to under-training. |
Evaluation Steps | Evaluate the model against the validation set every N steps, e.g N=10 indicates evaluation should be performed every 10 steps. |
Best Model Label | The label used to determine the optimal epoch based on the Best Model Metric. |
Best Model Metric | The accuracy metric used to select the best epoch during model training. |
Advanced Options
Option | Description |
---|---|
Apply Augmentations | Augmentations modify the training data during each epoch to prevent overfitting and help the model generalize to new data. This option is currently only supported for Image Classification v2. |
LoRA Rank | Rank determines how many parameters will be fine tuned. For complex use cases, a higher rank can yield better accuracy. Lower rank can speed up training but risks underfitting. |
LoRA Alpha | Lower alpha values result in smaller updates, potentially leading to more stable but slower convergence. Higher alpha values can speed up training but risk overshooting optimal parameters. |
LoRA Dropout | The percentage of neurons to randomly drop during each training iteration to prevent overfitting. Higher values provide stronger regularization, while lower values maintain more model complexity but risk overfitting. |
Randomization Seed | Ensure training reproducibility by setting the random seed. |
Advanced Options - Optimizer
Option | Description |
---|---|
Algorithm | Algorithm used to reduce loss and increase efficiency during model training. Currently supported algorithms: Adadelta, Adagrad, Adam, AdamW, SparseAdam, Adamax, SGD, ASGD, NAdam, RAdam, RMSprop, RProp, Hive Optimizer v2 |
Learning Rate Scheduler | Static schedulers like step decay reduce the learning rate systematically, improving the likelihood of convergence. Dynamic schedulers adapt based on performance, balancing speed and stability. Currently supported schedulers: linear, cosine, cosine with restarts, polynomial, constant, constant with warmup, piecewise constant |
Learning Rate | Set the initial learning rate for the model. Learning rate determines how much to adjust the model’s parameter weights during each iteration. Higher rates can speed up training but might miss the optimal model configuration. Lower learning rates often yield more accurate models by making careful adjustments, but if the learning rate is too low the model may not learn at all. |
Learning Rate Decay | Reduces the learning rate over time to ensure more refined updates as training progresses, reducing the risk of overshooting and helping with convergence. |
Weight Decay | Weight decay penalizes large weights and prevent overfitting. Higher decay values regularize more strongly, improving generalization but possibly underfitting. Lower values prevent underfitting but too low of a value can lead to overfitting. |
Initial Accumulator | Sets the initial value of the accumulator, which keeps track of past gradient updates. A higher value can prevent overly large updates early in the training process. |
Momentum | Adds a fraction of the previous gradient to the current gradient, helping the optimizer accelerate in the right direction and smooth out updates. Momentum helps overcome small, noisy gradients. |
Momentum Decay | Controls how quickly momentum fades over time. Lower values reduce the impact of older updates, while higher values retain more momentum from past updates. |
Dampening | Reduces the effect of momentum over time, making momentum updates less aggressive. Useful for stabilizing updates in noisy or highly dynamic gradients. |
Max Iter | Specifies the maximum number of iterations for optimization. Increasing this allows the model more time to converge but may lead to overfitting if set too high. |
Max Eval | Defines the maximum number of function evaluations allowed during optimization. Higher values give more opportunities for convergence but can increase computation time. |
Tolerance Grad | Specifies the threshold for the gradient below which optimization is considered converged. Lower values require smaller updates for convergence, ensuring finer model adjustments. |
Tolerance Change | Sets the minimum change in the objective function required between iterations for convergence. Lower values can lead to early stopping if the model is not improving sufficiently. |
History Size | Determines how much of the gradient history is retained for updating the learning rate. A larger history size smooths out short-term fluctuations but may slow responsiveness. |
Step Size | Defines the interval at which the learning rate is adjusted. Larger step sizes lead to more aggressive changes, while smaller steps allow for more granular adjustments. |
Etas | Etas control the factor by which the learning rate is adjusted when the gradient changes direction. Higher values make larger adjustments to the learning rate. |
Eps | A small constant added to the denominator to avoid division by zero during gradient updates. Larger eps values can lead to smoother updates, while smaller values provide more precision. Updating Eps typically does not yield a material impact to the model training. |
Betas | Controls how much of the previous updates the optimizer remembers. The first beta controls momentum, and the second beta affects how aggressively the optimizer adapts the learning rate. |
Rho | Affects how quickly the optimizer forgets past gradient updates. Higher values prioritize older gradients, while lower values react more quickly to recent updates. |
Alpha | Alpha controls the initial learning rate or how aggressively the optimizer updates the weights. |
Lambda | Lambda controls the decay of the learning rate. Lower lambda values slow down learning more rapidly, while higher values keep learning more aggressive. |
T0 | T0 is the point at which learning rate decay starts. Higher values delay decay, keeping the learning rate constant for longer periods. |
Advanced Options - Early Stopping
Option | Description |
---|---|
Label | The column used to determine whether the model should conclude training early based on Early Stopping Metric. |
Metric | Stop model training if the selected metric has not improved for N epochs. |
Patience | Determines how long to wait for improvement before stopping training early. Longer patience avoids premature stopping but may waste resources. Shorter patience stops early but risks missing further improvements. Patience is based on the evaluation epochs/steps, e.g. if evaluation steps is 10 and patience is 5, training will stop if the model has not improved in 50 steps. |
Threshold | The minimum change in the selected metric that counts as an improvement. A smaller threshold allows the model to capture finer performance gains, while a larger threshold may result in quicker early stopping. |
Updated about 2 months ago