Time to Complete: 1-2 hours
This guide provides comprehensive steps for integrating your Azure Databricks LLM development with GitHub, setting up CI/CD pipelines, and leveraging GitHub Copilot for enhanced development.
- Authentication Configuration
- CI/CD Pipelines with GitHub Actions
- Databricks Asset Bundles (DABs)
- Using GitHub Copilot for LLM Development
- Workflow Examples
For secure integration between GitHub and Azure Databricks, we'll use Azure service principals.
- Create an Azure Active Directory service principal:
az ad sp create-for-rbac --name "databricks-github-integration" --role contributor --scopes /subscriptions/<subscription-id>/resourceGroups/<resource-group>-
Take note of the following from the output:
appId(client ID)password(client secret)tenant(tenant ID)
-
Add these credentials to GitHub repository secrets:
- Navigate to your GitHub repository
- Go to Settings > Secrets and Variables > Actions
- Add the following secrets:
AZURE_CLIENT_ID: Service principal client IDAZURE_CLIENT_SECRET: Service principal client secretAZURE_TENANT_ID: Azure tenant IDAZURE_SUBSCRIPTION_ID: Your subscription IDDATABRICKS_HOST: Your Databricks workspace URLDATABRICKS_TOKEN: Your Databricks personal access token
- In your Databricks workspace, click on your profile icon (top right)
- Select "User Settings"
- Go to the "Access Tokens" tab
- Click "Generate New Token"
- Provide a name (e.g., "GitHub Integration") and set an expiration
- Copy the token and store it as
DATABRICKS_TOKENin GitHub secrets
Create the following GitHub Actions workflow files in your repository:
name: Continuous Integration
on:
push:
branches: [ main, development ]
pull_request:
branches: [ main, development ]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install pytest
- name: Run tests
run: |
pytest
- name: Lint code
run: |
pip install flake8
flake8 . --count --select=E9,F63,F7,F82 --show-source --statisticsname: Deploy to Databricks
on:
push:
branches: [ main ]
paths:
- 'notebooks/**'
- 'src/**'
- 'infrastructure/databricks/**'
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install Databricks CLI
run: |
pip install databricks-cli
- name: Configure Databricks CLI
run: |
echo "[DEFAULT]" > ~/.databrickscfg
echo "host = ${{ secrets.DATABRICKS_HOST }}" >> ~/.databrickscfg
echo "token = ${{ secrets.DATABRICKS_TOKEN }}" >> ~/.databrickscfg
- name: Deploy Notebooks
run: |
# Deploy notebooks to Databricks
databricks workspace import_dir notebooks /Shared/Deployment/notebooks -o
- name: Deploy Source Code
run: |
# Package and deploy source code
zip -r src.zip src
databricks fs cp src.zip dbfs:/FileStore/deployment/src.zip --overwritename: Deploy ML Model
on:
workflow_dispatch:
inputs:
model_name:
description: 'Name of the model to deploy'
required: true
model_version:
description: 'Version of the model to deploy'
required: true
environment:
description: 'Environment to deploy to'
required: true
default: 'staging'
type: choice
options:
- development
- staging
- production
jobs:
deploy-model:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install mlflow azure-identity
- name: Deploy model
env:
AZURE_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }}
AZURE_CLIENT_SECRET: ${{ secrets.AZURE_CLIENT_SECRET }}
AZURE_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
MODEL_NAME: ${{ github.event.inputs.model_name }}
MODEL_VERSION: ${{ github.event.inputs.model_version }}
ENVIRONMENT: ${{ github.event.inputs.environment }}
run: |
python scripts/deploy/deploy_model.pyDatabricks Asset Bundles (DABs) provide a way to package and deploy various Databricks assets like notebooks, ML models, and jobs.
- Install the Databricks SDK:
pip install databricks-sdk- Create a
dbundle.yamlfile at the root of your repository:
# dbundle.yaml
bundle:
name: llm-mlops-bundle
target: dev
workspace:
host: ${DATABRICKS_HOST}
resources:
notebooks:
path: notebooks
target_path: /Shared/llm-mlops-bundle
jobs:
my_training_job:
name: "LLM Training Job"
job_clusters:
- job_cluster_key: main
new_cluster:
spark_version: 13.3.x-gpu-ml-scala2.12
node_type_id: Standard_NC6s_v3
num_workers: 2
spark_conf:
spark.databricks.cluster.profile: singleNode
spark.master: local[*]
tasks:
- task_key: train_model
job_cluster_key: main
notebook_task:
notebook_path: /Shared/llm-mlops-bundle/development/train_model
base_parameters:
model_name: "llm-finetuned"
epochs: 3
models:
llm_model:
name: llm-finetuned
serving_mode: rag- Create a GitHub Action to deploy using DABs:
# .github/workflows/deploy_dab.yml
name: Deploy Databricks Assets
on:
push:
branches: [ main ]
paths:
- 'dbundle.yaml'
- 'notebooks/**'
jobs:
deploy-bundle:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install databricks-sdk
- name: Deploy DAB
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
run: |
python -m databricks bundle deployGitHub Copilot provides AI-assisted development that can significantly accelerate your LLM projects.
- Ensure GitHub Copilot is enabled for your GitHub account
- Install the GitHub Copilot extension in your IDE:
- VS Code: Search for "GitHub Copilot" in the extensions marketplace
- PyCharm: Install via "Plugins" in settings
- Vim/Neovim: Follow GitHub's plugin installation guides
When working with Copilot on LLM projects, use these prompting patterns:
-
Contextual Comments: Start with detailed comments about what you're trying to achieve
# Create a RAG pipeline that: # 1. Takes a user query # 2. Embeds it using Azure OpenAI embeddings # 3. Retrieves relevant chunks from a vector store # 4. Augments the original prompt with context # 5. Sends to Azure OpenAI GPT-4 for completion
-
Function Signatures: Define function signatures and let Copilot complete them
def process_llm_response(response, metadata=None): """Process the raw LLM response and extract relevant information. Args: response: The raw response from the LLM metadata: Optional metadata about the query context Returns: A dictionary containing processed response data """ # Copilot will suggest implementation
-
Test-Driven Development: Write test cases first, then let Copilot generate the implementation
def test_rag_retrieval(): query = "What is the capital of France?" result = rag_pipeline.retrieve(query, top_k=3) assert len(result) == 3 assert all("text" in doc for doc in result)
# .github/workflows/train_register_model.yml
name: Train and Register Model
on:
workflow_dispatch:
inputs:
model_name:
description: 'Name for the model'
required: true
default: 'llm-finetuned'
dataset:
description: 'Dataset to use for training'
required: true
epochs:
description: 'Number of training epochs'
required: true
default: '3'
jobs:
train-model:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Configure Databricks CLI
run: |
echo "[DEFAULT]" > ~/.databrickscfg
echo "host = ${{ secrets.DATABRICKS_HOST }}" >> ~/.databrickscfg
echo "token = ${{ secrets.DATABRICKS_TOKEN }}" >> ~/.databrickscfg
- name: Submit training job
run: |
databricks jobs submit --json '{
"name": "Train LLM Model",
"new_cluster": {
"spark_version": "13.3.x-gpu-ml-scala2.12",
"node_type_id": "Standard_NC6s_v3",
"num_workers": 0,
"spark_conf": {
"spark.databricks.cluster.profile": "singleNode",
"spark.master": "local[*]"
}
},
"notebook_task": {
"notebook_path": "/Shared/Deployment/notebooks/training/fine_tune_llm",
"base_parameters": {
"model_name": "${{ github.event.inputs.model_name }}",
"dataset": "${{ github.event.inputs.dataset }}",
"epochs": "${{ github.event.inputs.epochs }}"
}
}
}'# .github/workflows/mlops_ci.yml
name: MLOps CI
on:
pull_request:
branches: [ main ]
paths:
- 'src/model_training/**'
- 'src/evaluation/**'
- 'src/deployment/**'
- 'tests/**'
jobs:
validate-mlops:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install pytest pytest-cov black flake8
- name: Lint with flake8
run: |
flake8 src tests
- name: Format check with black
run: |
black --check src tests
- name: Run tests
run: |
pytest tests/ --cov=src
- name: Upload test coverage
uses: codecov/codecov-action@v3
with:
file: ./coverage.xml
fail_ci_if_error: falseThese workflow examples provide a foundation that you can customize for your specific LLM development and deployment needs on Azure Databricks.