Training Options

A guide to training options for AutoML

AutoML provides default training options that work well for most objectives. You can also customize each option to better suit your use case or just to experiment with different training configurations. See the full set of training options and their descriptions below.

Snapshot Options

OptionDescription
Training SnapshotData to be used as the training dataset for the model.
Snapshot SplitUse the existing split in the provided training snapshot, select a new split percentage, or select a separate validation snapshot.
Validation SnapshotData to be used as the validation dataset for the model.
Test SnapshotsData to be used as the test dataset for the model.
Max Invalid Row PercentageMaximum percentage of invalid rows allowed in the training dataset before failing the training. Rows may be valid for some base models but not others, e.g. a row with 1000 tokens is valid for training Longformer but invalid for training DeBERTa.

Training Options

OptionDescription
Validation StrategyDetermines whether the model is evaluated against the validation set every epoch or some number of steps as defined by the evaluation steps option. An epoch is one iteration through the complete training dataset, while a step is one batch (subset) of the training dataset.
Total Epochs/StepsThe total number of epochs for model training. More epochs allow for more thorough training and increased accuracy. Fewer epochs reduce overfitting risk but may lead to under-training.
Evaluation StepsEvaluate the model against the validation set every N steps, e.g N=10 indicates evaluation should be performed every 10 steps.
Best Model LabelThe label used to determine the optimal epoch based on the Best Model Metric.
Best Model MetricThe accuracy metric used to select the best epoch during model training.

Advanced Options

OptionDescription
Apply AugmentationsAugmentations modify the training data during each epoch to prevent overfitting and help the model generalize to new data. This option is currently only supported for Image Classification v2.
LoRA RankRank determines how many parameters will be fine tuned. For complex use cases, a higher rank can yield better accuracy. Lower rank can speed up training but risks underfitting.
LoRA AlphaLower alpha values result in smaller updates, potentially leading to more stable but slower convergence. Higher alpha values can speed up training but risk overshooting optimal parameters.
LoRA DropoutThe percentage of neurons to randomly drop during each training iteration to prevent overfitting. Higher values provide stronger regularization, while lower values maintain more model complexity but risk overfitting.
Randomization SeedEnsure training reproducibility by setting the random seed.

Advanced Options - Optimizer

OptionDescription
AlgorithmAlgorithm used to reduce loss and increase efficiency during model training.

Currently supported algorithms: Adadelta, Adagrad, Adam, AdamW, SparseAdam, Adamax, SGD, ASGD, NAdam, RAdam, RMSprop, RProp, Hive Optimizer v2
Learning Rate SchedulerStatic schedulers like step decay reduce the learning rate systematically, improving the likelihood of convergence. Dynamic schedulers adapt based on performance, balancing speed and stability.

Currently supported schedulers: linear, cosine, cosine with restarts, polynomial, constant, constant with warmup, piecewise constant
Learning RateSet the initial learning rate for the model. Learning rate determines how much to adjust the model’s parameter weights during each iteration. Higher rates can speed up training but might miss the optimal model configuration. Lower learning rates often yield more accurate models by making careful adjustments, but if the learning rate is too low the model may not learn at all.
Learning Rate DecayReduces the learning rate over time to ensure more refined updates as training progresses, reducing the risk of overshooting and helping with convergence.
Weight DecayWeight decay penalizes large weights and prevent overfitting. Higher decay values regularize more strongly, improving generalization but possibly underfitting. Lower values prevent underfitting but too low of a value can lead to overfitting.
Initial AccumulatorSets the initial value of the accumulator, which keeps track of past gradient updates. A higher value can prevent overly large updates early in the training process.
MomentumAdds a fraction of the previous gradient to the current gradient, helping the optimizer accelerate in the right direction and smooth out updates. Momentum helps overcome small, noisy gradients.
Momentum DecayControls how quickly momentum fades over time. Lower values reduce the impact of older updates, while higher values retain more momentum from past updates.
DampeningReduces the effect of momentum over time, making momentum updates less aggressive. Useful for stabilizing updates in noisy or highly dynamic gradients.
Max IterSpecifies the maximum number of iterations for optimization. Increasing this allows the model more time to converge but may lead to overfitting if set too high.
Max EvalDefines the maximum number of function evaluations allowed during optimization. Higher values give more opportunities for convergence but can increase computation time.
Tolerance GradSpecifies the threshold for the gradient below which optimization is considered converged. Lower values require smaller updates for convergence, ensuring finer model adjustments.
Tolerance ChangeSets the minimum change in the objective function required between iterations for convergence. Lower values can lead to early stopping if the model is not improving sufficiently.
History SizeDetermines how much of the gradient history is retained for updating the learning rate. A larger history size smooths out short-term fluctuations but may slow responsiveness.
Step SizeDefines the interval at which the learning rate is adjusted. Larger step sizes lead to more aggressive changes, while smaller steps allow for more granular adjustments.
EtasEtas control the factor by which the learning rate is adjusted when the gradient changes direction. Higher values make larger adjustments to the learning rate.
EpsA small constant added to the denominator to avoid division by zero during gradient updates. Larger eps values can lead to smoother updates, while smaller values provide more precision. Updating Eps typically does not yield a material impact to the model training.
BetasControls how much of the previous updates the optimizer remembers. The first beta controls momentum, and the second beta affects how aggressively the optimizer adapts the learning rate.
RhoAffects how quickly the optimizer forgets past gradient updates. Higher values prioritize older gradients, while lower values react more quickly to recent updates.
AlphaAlpha controls the initial learning rate or how aggressively the optimizer updates the weights.
LambdaLambda controls the decay of the learning rate. Lower lambda values slow down learning more rapidly, while higher values keep learning more aggressive.
T0T0 is the point at which learning rate decay starts. Higher values delay decay, keeping the learning rate constant for longer periods.

Advanced Options - Early Stopping

OptionDescription
LabelThe column used to determine whether the model should conclude training early based on Early Stopping Metric.
MetricStop model training if the selected metric has not improved for N epochs.
PatienceDetermines how long to wait for improvement before stopping training early. Longer patience avoids premature stopping but may waste resources. Shorter patience stops early but risks missing further improvements. Patience is based on the evaluation epochs/steps, e.g. if evaluation steps is 10 and patience is 5, training will stop if the model has not improved in 50 steps.
ThresholdThe minimum change in the selected metric that counts as an improvement. A smaller threshold allows the model to capture finer performance gains, while a larger threshold may result in quicker early stopping.