🔄 GitHub Platform Integration

Time to Complete: 1-2 hours

This guide provides comprehensive steps for integrating your Azure Databricks LLM development with GitHub, setting up CI/CD pipelines, and leveraging GitHub Copilot for enhanced development.

Authentication Configuration
CI/CD Pipelines with GitHub Actions
Databricks Asset Bundles (DABs)
Using GitHub Copilot for LLM Development
Workflow Examples

Authentication Configuration

Setting Up Service Principals

For secure integration between GitHub and Azure Databricks, we'll use Azure service principals.

Create an Azure Active Directory service principal:

az ad sp create-for-rbac --name "databricks-github-integration" --role contributor --scopes /subscriptions/<subscription-id>/resourceGroups/<resource-group>

Take note of the following from the output:
- appId (client ID)
- password (client secret)
- tenant (tenant ID)
Add these credentials to GitHub repository secrets:
- Navigate to your GitHub repository
- Go to Settings > Secrets and Variables > Actions
- Add the following secrets:
  - AZURE_CLIENT_ID: Service principal client ID
  - AZURE_CLIENT_SECRET: Service principal client secret
  - AZURE_TENANT_ID: Azure tenant ID
  - AZURE_SUBSCRIPTION_ID: Your subscription ID
  - DATABRICKS_HOST: Your Databricks workspace URL
  - DATABRICKS_TOKEN: Your Databricks personal access token

Creating a Databricks Personal Access Token

In your Databricks workspace, click on your profile icon (top right)
Select "User Settings"
Go to the "Access Tokens" tab
Click "Generate New Token"
Provide a name (e.g., "GitHub Integration") and set an expiration
Copy the token and store it as DATABRICKS_TOKEN in GitHub secrets

CI/CD Pipelines with GitHub Actions

Setting Up CI/CD for LLM Projects

Create the following GitHub Actions workflow files in your repository:

1. Basic CI Pipeline `.github/workflows/ci.yml`:

name: Continuous Integration

on:
  push:
    branches: [ main, development ]
  pull_request:
    branches: [ main, development ]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
        pip install pytest
    - name: Run tests
      run: |
        pytest
    - name: Lint code
      run: |
        pip install flake8
        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

2. Databricks Deployment Pipeline `.github/workflows/deploy.yml`:

name: Deploy to Databricks

on:
  push:
    branches: [ main ]
    paths:
      - 'notebooks/**'
      - 'src/**'
      - 'infrastructure/databricks/**'

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Install Databricks CLI
      run: |
        pip install databricks-cli
    - name: Configure Databricks CLI
      run: |
        echo "[DEFAULT]" > ~/.databrickscfg
        echo "host = ${{ secrets.DATABRICKS_HOST }}" >> ~/.databrickscfg
        echo "token = ${{ secrets.DATABRICKS_TOKEN }}" >> ~/.databrickscfg
    - name: Deploy Notebooks
      run: |
        # Deploy notebooks to Databricks
        databricks workspace import_dir notebooks /Shared/Deployment/notebooks -o
    - name: Deploy Source Code
      run: |
        # Package and deploy source code
        zip -r src.zip src
        databricks fs cp src.zip dbfs:/FileStore/deployment/src.zip --overwrite

3. Model Deployment Workflow `.github/workflows/model_deploy.yml`:

name: Deploy ML Model

on:
  workflow_dispatch:
    inputs:
      model_name:
        description: 'Name of the model to deploy'
        required: true
      model_version:
        description: 'Version of the model to deploy'
        required: true
      environment:
        description: 'Environment to deploy to'
        required: true
        default: 'staging'
        type: choice
        options:
          - development
          - staging
          - production

jobs:
  deploy-model:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
        pip install mlflow azure-identity
    - name: Deploy model
      env:
        AZURE_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }}
        AZURE_CLIENT_SECRET: ${{ secrets.AZURE_CLIENT_SECRET }}
        AZURE_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
        DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
        DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
        MODEL_NAME: ${{ github.event.inputs.model_name }}
        MODEL_VERSION: ${{ github.event.inputs.model_version }}
        ENVIRONMENT: ${{ github.event.inputs.environment }}
      run: |
        python scripts/deploy/deploy_model.py

Databricks Asset Bundles (DABs)

Databricks Asset Bundles (DABs) provide a way to package and deploy various Databricks assets like notebooks, ML models, and jobs.

Setting Up Databricks Asset Bundles

Install the Databricks SDK:

pip install databricks-sdk

Create a dbundle.yaml file at the root of your repository:

# dbundle.yaml
bundle:
  name: llm-mlops-bundle
  target: dev

workspace:
  host: ${DATABRICKS_HOST}

resources:
  notebooks:
    path: notebooks
    target_path: /Shared/llm-mlops-bundle
  
  jobs:
    my_training_job:
      name: "LLM Training Job"
      job_clusters:
        - job_cluster_key: main
          new_cluster:
            spark_version: 13.3.x-gpu-ml-scala2.12
            node_type_id: Standard_NC6s_v3
            num_workers: 2
            spark_conf:
              spark.databricks.cluster.profile: singleNode
              spark.master: local[*]
      tasks:
        - task_key: train_model
          job_cluster_key: main
          notebook_task:
            notebook_path: /Shared/llm-mlops-bundle/development/train_model
            base_parameters:
              model_name: "llm-finetuned"
              epochs: 3
          
  models:
    llm_model:
      name: llm-finetuned
      serving_mode: rag

Create a GitHub Action to deploy using DABs:

# .github/workflows/deploy_dab.yml
name: Deploy Databricks Assets

on:
  push:
    branches: [ main ]
    paths:
      - 'dbundle.yaml'
      - 'notebooks/**'

jobs:
  deploy-bundle:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install databricks-sdk
    - name: Deploy DAB
      env:
        DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
        DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
      run: |
        python -m databricks bundle deploy

Using GitHub Copilot for LLM Development

GitHub Copilot provides AI-assisted development that can significantly accelerate your LLM projects.

Setting Up Copilot

Ensure GitHub Copilot is enabled for your GitHub account
Install the GitHub Copilot extension in your IDE:
- VS Code: Search for "GitHub Copilot" in the extensions marketplace
- PyCharm: Install via "Plugins" in settings
- Vim/Neovim: Follow GitHub's plugin installation guides

Effective Use of Copilot for LLM Development

Prompting Strategies for LLM Code

When working with Copilot on LLM projects, use these prompting patterns:

Contextual Comments: Start with detailed comments about what you're trying to achieve

# Create a RAG pipeline that:
# 1. Takes a user query
# 2. Embeds it using Azure OpenAI embeddings
# 3. Retrieves relevant chunks from a vector store
# 4. Augments the original prompt with context
# 5. Sends to Azure OpenAI GPT-4 for completion

Function Signatures: Define function signatures and let Copilot complete them

def process_llm_response(response, metadata=None):
    """Process the raw LLM response and extract relevant information.
    
    Args:
        response: The raw response from the LLM
        metadata: Optional metadata about the query context
        
    Returns:
        A dictionary containing processed response data
    """
    # Copilot will suggest implementation

Test-Driven Development: Write test cases first, then let Copilot generate the implementation

def test_rag_retrieval():
    query = "What is the capital of France?"
    result = rag_pipeline.retrieve(query, top_k=3)
    assert len(result) == 3
    assert all("text" in doc for doc in result)

Workflow Examples

Example 1: Training and Registering an LLM Model

# .github/workflows/train_register_model.yml
name: Train and Register Model

on:
  workflow_dispatch:
    inputs:
      model_name:
        description: 'Name for the model'
        required: true
        default: 'llm-finetuned'
      dataset:
        description: 'Dataset to use for training'
        required: true
      epochs:
        description: 'Number of training epochs'
        required: true
        default: '3'

jobs:
  train-model:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
    - name: Configure Databricks CLI
      run: |
        echo "[DEFAULT]" > ~/.databrickscfg
        echo "host = ${{ secrets.DATABRICKS_HOST }}" >> ~/.databrickscfg
        echo "token = ${{ secrets.DATABRICKS_TOKEN }}" >> ~/.databrickscfg
    - name: Submit training job
      run: |
        databricks jobs submit --json '{
          "name": "Train LLM Model",
          "new_cluster": {
            "spark_version": "13.3.x-gpu-ml-scala2.12",
            "node_type_id": "Standard_NC6s_v3",
            "num_workers": 0,
            "spark_conf": {
              "spark.databricks.cluster.profile": "singleNode",
              "spark.master": "local[*]"
            }
          },
          "notebook_task": {
            "notebook_path": "/Shared/Deployment/notebooks/training/fine_tune_llm",
            "base_parameters": {
              "model_name": "${{ github.event.inputs.model_name }}",
              "dataset": "${{ github.event.inputs.dataset }}",
              "epochs": "${{ github.event.inputs.epochs }}"
            }
          }
        }'

Example 2: Continuous Integration for MLOps Code

# .github/workflows/mlops_ci.yml
name: MLOps CI

on:
  pull_request:
    branches: [ main ]
    paths:
      - 'src/model_training/**'
      - 'src/evaluation/**'
      - 'src/deployment/**'
      - 'tests/**'

jobs:
  validate-mlops:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
        pip install pytest pytest-cov black flake8
    - name: Lint with flake8
      run: |
        flake8 src tests
    - name: Format check with black
      run: |
        black --check src tests
    - name: Run tests
      run: |
        pytest tests/ --cov=src
    - name: Upload test coverage
      uses: codecov/codecov-action@v3
      with:
        file: ./coverage.xml
        fail_ci_if_error: false

These workflow examples provide a foundation that you can customize for your specific LLM development and deployment needs on Azure Databricks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔄 GitHub Platform Integration

Table of Contents

Authentication Configuration

Setting Up Service Principals

Creating a Databricks Personal Access Token

CI/CD Pipelines with GitHub Actions

Setting Up CI/CD for LLM Projects

1. Basic CI Pipeline `.github/workflows/ci.yml`:

2. Databricks Deployment Pipeline `.github/workflows/deploy.yml`:

3. Model Deployment Workflow `.github/workflows/model_deploy.yml`:

Databricks Asset Bundles (DABs)

Setting Up Databricks Asset Bundles

Using GitHub Copilot for LLM Development

Setting Up Copilot

Effective Use of Copilot for LLM Development

Prompting Strategies for LLM Code

Workflow Examples

Example 1: Training and Registering an LLM Model

Example 2: Continuous Integration for MLOps Code

FilesExpand file tree

github_integration.md

Latest commit

History

github_integration.md

File metadata and controls

🔄 GitHub Platform Integration

Table of Contents

Authentication Configuration

Setting Up Service Principals

Creating a Databricks Personal Access Token

CI/CD Pipelines with GitHub Actions

Setting Up CI/CD for LLM Projects

1. Basic CI Pipeline .github/workflows/ci.yml:

2. Databricks Deployment Pipeline .github/workflows/deploy.yml:

3. Model Deployment Workflow .github/workflows/model_deploy.yml:

Databricks Asset Bundles (DABs)

Setting Up Databricks Asset Bundles

Using GitHub Copilot for LLM Development

Setting Up Copilot

Effective Use of Copilot for LLM Development

Prompting Strategies for LLM Code

Workflow Examples

Example 1: Training and Registering an LLM Model

Example 2: Continuous Integration for MLOps Code

1. Basic CI Pipeline `.github/workflows/ci.yml`:

2. Databricks Deployment Pipeline `.github/workflows/deploy.yml`:

3. Model Deployment Workflow `.github/workflows/model_deploy.yml`: