Skip to content

Latest commit

 

History

History
435 lines (371 loc) · 12.2 KB

File metadata and controls

435 lines (371 loc) · 12.2 KB

🔄 GitHub Platform Integration

Time to Complete: 1-2 hours

This guide provides comprehensive steps for integrating your Azure Databricks LLM development with GitHub, setting up CI/CD pipelines, and leveraging GitHub Copilot for enhanced development.

Table of Contents

Authentication Configuration

Setting Up Service Principals

For secure integration between GitHub and Azure Databricks, we'll use Azure service principals.

  1. Create an Azure Active Directory service principal:
az ad sp create-for-rbac --name "databricks-github-integration" --role contributor --scopes /subscriptions/<subscription-id>/resourceGroups/<resource-group>
  1. Take note of the following from the output:

    • appId (client ID)
    • password (client secret)
    • tenant (tenant ID)
  2. Add these credentials to GitHub repository secrets:

    • Navigate to your GitHub repository
    • Go to Settings > Secrets and Variables > Actions
    • Add the following secrets:
      • AZURE_CLIENT_ID: Service principal client ID
      • AZURE_CLIENT_SECRET: Service principal client secret
      • AZURE_TENANT_ID: Azure tenant ID
      • AZURE_SUBSCRIPTION_ID: Your subscription ID
      • DATABRICKS_HOST: Your Databricks workspace URL
      • DATABRICKS_TOKEN: Your Databricks personal access token

Creating a Databricks Personal Access Token

  1. In your Databricks workspace, click on your profile icon (top right)
  2. Select "User Settings"
  3. Go to the "Access Tokens" tab
  4. Click "Generate New Token"
  5. Provide a name (e.g., "GitHub Integration") and set an expiration
  6. Copy the token and store it as DATABRICKS_TOKEN in GitHub secrets

CI/CD Pipelines with GitHub Actions

Setting Up CI/CD for LLM Projects

Create the following GitHub Actions workflow files in your repository:

1. Basic CI Pipeline .github/workflows/ci.yml:

name: Continuous Integration

on:
  push:
    branches: [ main, development ]
  pull_request:
    branches: [ main, development ]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
        pip install pytest
    - name: Run tests
      run: |
        pytest
    - name: Lint code
      run: |
        pip install flake8
        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

2. Databricks Deployment Pipeline .github/workflows/deploy.yml:

name: Deploy to Databricks

on:
  push:
    branches: [ main ]
    paths:
      - 'notebooks/**'
      - 'src/**'
      - 'infrastructure/databricks/**'

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Install Databricks CLI
      run: |
        pip install databricks-cli
    - name: Configure Databricks CLI
      run: |
        echo "[DEFAULT]" > ~/.databrickscfg
        echo "host = ${{ secrets.DATABRICKS_HOST }}" >> ~/.databrickscfg
        echo "token = ${{ secrets.DATABRICKS_TOKEN }}" >> ~/.databrickscfg
    - name: Deploy Notebooks
      run: |
        # Deploy notebooks to Databricks
        databricks workspace import_dir notebooks /Shared/Deployment/notebooks -o
    - name: Deploy Source Code
      run: |
        # Package and deploy source code
        zip -r src.zip src
        databricks fs cp src.zip dbfs:/FileStore/deployment/src.zip --overwrite

3. Model Deployment Workflow .github/workflows/model_deploy.yml:

name: Deploy ML Model

on:
  workflow_dispatch:
    inputs:
      model_name:
        description: 'Name of the model to deploy'
        required: true
      model_version:
        description: 'Version of the model to deploy'
        required: true
      environment:
        description: 'Environment to deploy to'
        required: true
        default: 'staging'
        type: choice
        options:
          - development
          - staging
          - production

jobs:
  deploy-model:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
        pip install mlflow azure-identity
    - name: Deploy model
      env:
        AZURE_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }}
        AZURE_CLIENT_SECRET: ${{ secrets.AZURE_CLIENT_SECRET }}
        AZURE_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
        DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
        DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
        MODEL_NAME: ${{ github.event.inputs.model_name }}
        MODEL_VERSION: ${{ github.event.inputs.model_version }}
        ENVIRONMENT: ${{ github.event.inputs.environment }}
      run: |
        python scripts/deploy/deploy_model.py

Databricks Asset Bundles (DABs)

Databricks Asset Bundles (DABs) provide a way to package and deploy various Databricks assets like notebooks, ML models, and jobs.

Setting Up Databricks Asset Bundles

  1. Install the Databricks SDK:
pip install databricks-sdk
  1. Create a dbundle.yaml file at the root of your repository:
# dbundle.yaml
bundle:
  name: llm-mlops-bundle
  target: dev

workspace:
  host: ${DATABRICKS_HOST}

resources:
  notebooks:
    path: notebooks
    target_path: /Shared/llm-mlops-bundle
  
  jobs:
    my_training_job:
      name: "LLM Training Job"
      job_clusters:
        - job_cluster_key: main
          new_cluster:
            spark_version: 13.3.x-gpu-ml-scala2.12
            node_type_id: Standard_NC6s_v3
            num_workers: 2
            spark_conf:
              spark.databricks.cluster.profile: singleNode
              spark.master: local[*]
      tasks:
        - task_key: train_model
          job_cluster_key: main
          notebook_task:
            notebook_path: /Shared/llm-mlops-bundle/development/train_model
            base_parameters:
              model_name: "llm-finetuned"
              epochs: 3
          
  models:
    llm_model:
      name: llm-finetuned
      serving_mode: rag
  1. Create a GitHub Action to deploy using DABs:
# .github/workflows/deploy_dab.yml
name: Deploy Databricks Assets

on:
  push:
    branches: [ main ]
    paths:
      - 'dbundle.yaml'
      - 'notebooks/**'

jobs:
  deploy-bundle:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install databricks-sdk
    - name: Deploy DAB
      env:
        DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
        DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
      run: |
        python -m databricks bundle deploy

Using GitHub Copilot for LLM Development

GitHub Copilot provides AI-assisted development that can significantly accelerate your LLM projects.

Setting Up Copilot

  1. Ensure GitHub Copilot is enabled for your GitHub account
  2. Install the GitHub Copilot extension in your IDE:
    • VS Code: Search for "GitHub Copilot" in the extensions marketplace
    • PyCharm: Install via "Plugins" in settings
    • Vim/Neovim: Follow GitHub's plugin installation guides

Effective Use of Copilot for LLM Development

Prompting Strategies for LLM Code

When working with Copilot on LLM projects, use these prompting patterns:

  1. Contextual Comments: Start with detailed comments about what you're trying to achieve

    # Create a RAG pipeline that:
    # 1. Takes a user query
    # 2. Embeds it using Azure OpenAI embeddings
    # 3. Retrieves relevant chunks from a vector store
    # 4. Augments the original prompt with context
    # 5. Sends to Azure OpenAI GPT-4 for completion
  2. Function Signatures: Define function signatures and let Copilot complete them

    def process_llm_response(response, metadata=None):
        """Process the raw LLM response and extract relevant information.
        
        Args:
            response: The raw response from the LLM
            metadata: Optional metadata about the query context
            
        Returns:
            A dictionary containing processed response data
        """
        # Copilot will suggest implementation
  3. Test-Driven Development: Write test cases first, then let Copilot generate the implementation

    def test_rag_retrieval():
        query = "What is the capital of France?"
        result = rag_pipeline.retrieve(query, top_k=3)
        assert len(result) == 3
        assert all("text" in doc for doc in result)

Workflow Examples

Example 1: Training and Registering an LLM Model

# .github/workflows/train_register_model.yml
name: Train and Register Model

on:
  workflow_dispatch:
    inputs:
      model_name:
        description: 'Name for the model'
        required: true
        default: 'llm-finetuned'
      dataset:
        description: 'Dataset to use for training'
        required: true
      epochs:
        description: 'Number of training epochs'
        required: true
        default: '3'

jobs:
  train-model:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
    - name: Configure Databricks CLI
      run: |
        echo "[DEFAULT]" > ~/.databrickscfg
        echo "host = ${{ secrets.DATABRICKS_HOST }}" >> ~/.databrickscfg
        echo "token = ${{ secrets.DATABRICKS_TOKEN }}" >> ~/.databrickscfg
    - name: Submit training job
      run: |
        databricks jobs submit --json '{
          "name": "Train LLM Model",
          "new_cluster": {
            "spark_version": "13.3.x-gpu-ml-scala2.12",
            "node_type_id": "Standard_NC6s_v3",
            "num_workers": 0,
            "spark_conf": {
              "spark.databricks.cluster.profile": "singleNode",
              "spark.master": "local[*]"
            }
          },
          "notebook_task": {
            "notebook_path": "/Shared/Deployment/notebooks/training/fine_tune_llm",
            "base_parameters": {
              "model_name": "${{ github.event.inputs.model_name }}",
              "dataset": "${{ github.event.inputs.dataset }}",
              "epochs": "${{ github.event.inputs.epochs }}"
            }
          }
        }'

Example 2: Continuous Integration for MLOps Code

# .github/workflows/mlops_ci.yml
name: MLOps CI

on:
  pull_request:
    branches: [ main ]
    paths:
      - 'src/model_training/**'
      - 'src/evaluation/**'
      - 'src/deployment/**'
      - 'tests/**'

jobs:
  validate-mlops:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
        pip install pytest pytest-cov black flake8
    - name: Lint with flake8
      run: |
        flake8 src tests
    - name: Format check with black
      run: |
        black --check src tests
    - name: Run tests
      run: |
        pytest tests/ --cov=src
    - name: Upload test coverage
      uses: codecov/codecov-action@v3
      with:
        file: ./coverage.xml
        fail_ci_if_error: false

These workflow examples provide a foundation that you can customize for your specific LLM development and deployment needs on Azure Databricks.