![]() ![]() ![]() Because Ludwig caches preprocessed data and reuses it for new model training experiments (when preprocessing parameters are the same), any subsequent training runs for the same dataset will require no forward passes on the pretrained model whatsoever. While embedding caching is useful for a single model training run because it pays the cost of the forward pass on the large pretrained model for only a single epoch, it can be even more valuable in subsequent experiments. At prediction time and when exporting the model to TorchScript, the Skip Encoder is replaced with the original encoder and its pretrained weights. When the embeddings are cached in the preprocessed data, we replace the respective encoder in the ECD model at training time with a shim “Skip Encoder” that passes the input tensor data directly into the downstream combiner for concatenation with other input features. In the future, we may make mixed precision the default in Ludwig when training on GPU, but for now, you can enable it by setting use_mixed_precision in the trainer section of your config: In our tests, we found it typically gave anywhere from a 2 to 2.5x speedup in training time. Mixed precision training is not always reliable in achieving the same convergence in model quality as float32 training, but its effect on large pretrained models like BERT is well-understood, and as such we recommend enabling it for any fine-tuning task. See PyTorch’s blog for more details on the technique. Mixed Precision TrainingĪutomatic Mixed Precision (AMP) is a method for speeding up training by using float16 parameters where it makes sense. Because this can result in training with very large batch sizes, we’ve also introduced ghost batch normalization for most of our encoders, combiners, and decoders, allowing you to train with very large batch sizes without degrading model convergence.Įxample of a AutoML config that includes new optimizations for fine-tuning text models.Ĭheck out our new Fine-Tuning Guide in the Ludwig documentation for more details. Made batch_size=auto the default, allowing Ludwig to automatically tune the batch size to maximize training throughput. Though this makes training metrics appear “noisy” in the early epochs of training, it generally results in a 33% speedup in training time, and as such we’ve made it the default behavior in v0.7. ![]() Approximate training set evaluation (evaluate_training_set=false), which computes the reported training set metrics at the end of each epoch as a running aggregation of the metrics during training, rather than a separate pass over the training set at the end of each epoch of training.Cached encoder embeddings, which is only available when trainable=false.Automatic mixed precision (AMP) training, which is available when both trainable=true and trainable=false.In v0.7, we’ve introduced a number of significant performance improvements that collectively improve training throughput by 2x when trainable=true and over 50x when trainable=false.: This is sometimes distinguished as transfer learning.īoth of these options were available to users in Ludwig v0.6, but were slow to train without an expensive multi-GPU setup to take advantage of distributed training. Keeping the pretrained encoder weights fixed and training a stack of dense layers that sit downstream as the combiner and decoder modules (trainable=false).Modifying the weights of the pretrained encoder to adapt them to the downstream task (trainable=true).50x Faster Fine-TuningĪs mentioned above, Ludwig currently supports two variations on fine-tuning, configured via the trainable encoder parameter: Ludwig v0.7 also introduces image augmentation, artificially increasing the size of the training dataset by applying a randomized set of transformations to each batch of images during training.Ĭustomize your image augmentation pipeline with full control over steps and parameters.įull the complete list of options, see the Image Augmentation documentation. After only 2 epochs of training, the pretrained run was already at around 90% accuracy.Īs with Ludwig’s text encoders, users can set a single trainable parameter to switch between adjusting the weights of the pretrained model, or keep the weights of the pretrained model fixed and train a series of dense layers on top. Even using the pretrained model without fine-tuning the weights (trainable=false), we observed a 20% increase in accuracy on the held-out test set over 20 epochs of training (95% accuracy with pretrained weights, 75% accuracy without). The figure above compares a randomly initialized Resnet encoder (available in Ludwig v0.6) against the same model with weights pretrained on ImageNet. Comparing accuracy in Ludwig with and without pretrained weights on the Kaggle Dogs vs Cats dataset. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |