
What is Foundation Model Training?
Foundation Model Training allows you to tune or further train a foundation model using the Azure Databricks API or UI. This allows you to train your own model using less data, time, and compute resources, which is more efficient than training a model from scratch.
What Can You Do with Foundation Model Training?
Train a model with your own custom data, store the results in MLflow, and have full control over the model.
Automatically register the model in the Unity Catalog, which allows for easy deployment using the model service.
Complete a previously trained model by loading weights or further train a custom model.
Databricks recommends using Foundation Model Training in the following cases:
You have tried few-shot learning and want better results.
You have tried prompt engineering on an existing model and want better results.
You want full ownership of a specific model for data privacy.
You are sensitive to latency or cost and want to use a smaller, cheaper model with specific data.
Supported Tasks
Foundation Model Training supports the following use cases:
Chat completion: Train on chat logs between a user and an AI assistant. This format can be used for both actual chat logs and Q&A and conversational text.
Supervised fine-tuning: Train on structured prompt-response data. Used to adapt the model to a new task, change response style, or add instruction-following capabilities.
Continued pre-training: Train the model with additional text data. Used to add new knowledge to the model or to focus it on a specific domain.
Requirements
Requirements for Foundation Model Training:
A Databricks workspace in the following Azure regions: centralus, eastus, eastus2, northcentralus, westcentralus, westus, westus3.
Install the Foundation Model Training APIs: pip install databricks_genai.
Databricks Runtime 12.2 LTS ML or higher (if the data is in a Delta table).
Data Required for Model Training
Supervised fine-tuning and chat completion: Sufficient tokens must be provided for the full context length of the model. For example, 4096 tokens for meta-llama/Llama-2-7b-chat-hf, 32768 tokens for mistralai/Mistral-7B-v0.1.
Continued pre-training: A minimum of 1.5 million examples is recommended.
Supported Models
Use the get_models() function for the latest supported models and their context lengths.
Limitations
Large datasets (10B+ tokens) are not supported due to compute availability.
PrivateLink is not supported.
Workloads for continuous pre-training are limited to 60-256MB files.
Databricks strives to provide the latest state-of-the-art models for building custom models using Foundation Model Training. When new models are available, older models are removed from the API or UI or are no longer supported, with at least three months’ notice.
For more information and demos, follow the Databricks instructions and try out this new feature.