Customizing Schema and Data

How to Use Your Own Data

Files processed by the solution are mapped and transformed into schemas — strongly typed Pydantic class definitions that represent a standardized output for each document type. For example, the accelerator includes an AutoInsuranceClaimForm schema with fields like policy_number, date_of_loss, and vehicle_information.

Using AI, the processing pipeline extracts content from each document (text, images, tables), then maps the extracted data into the schema fields using GPT-5.1 with structured JSON output — field descriptions in the schema class act as extraction guidance for the LLM.

Schemas need to be created specific to your business and domain requirements. A lot of times schemas may be generally common across industries, but this allows for variations specific to your use case.

Schema & SchemaSet Structure

Before processing documents, schemas must be registered in the system and grouped into schema sets. The diagram below shows the three-step preparation flow and how schemas are used at runtime:

flowchart TB
    subgraph Step1["<b>Step 1: Register Schemas</b> (one per document type)<br/>POST /schemavault/ × N"]
        S1["🗎 AutoInsuranceClaimForm<br/><i>autoclaim.py</i><br/>Schema ID: abc123"]
        S2["🗎 PoliceReportDocument<br/><i>policereport.py</i><br/>Schema ID: def456"]
        S3["🗎 RepairEstimateDocument<br/><i>repairestimate.py</i><br/>Schema ID: ghi789"]
        S4["🗎 ...<br/><i>more schemas</i>"]
    end

    subgraph Step2["<b>Step 2: Create SchemaSet</b><br/>POST /schemasetvault/"]
        SS["📂 SchemaSet: FNOL Auto Claims<br/>SchemaSet ID: ss-001<br/>Schemas: [ ] <i>(empty)</i>"]
    end

    subgraph Step3["<b>Step 3: Add Schemas to SchemaSet</b><br/>POST /schemasetvault/{ss-001}/schemas × N"]
        SS_FULL["📂 SchemaSet: FNOL Auto Claims (ss-001)"]
        SS_S1["🗎 AutoInsuranceClaimForm<br/>(abc123)"]
        SS_S2["🗎 PoliceReportDocument<br/>(def456)"]
        SS_S3["🗎 RepairEstimateDocument<br/>(ghi789)"]
        SS_FULL --- SS_S1
        SS_FULL --- SS_S2
        SS_FULL --- SS_S3
    end

    subgraph Runtime["<b>Runtime — Pipeline Map Step</b>"]
        R1["1. Look up Schema metadata<br/>from Cosmos DB"]
        R2["2. Download .py class file<br/>from Blob Storage"]
        R3["3. Dynamically load Pydantic class<br/>→ generate JSON Schema"]
        R4["4. Embed JSON Schema in<br/>GPT-5.1 prompt"]
        R5["5. Validate response with<br/>Pydantic → confidence scoring"]
        R1 --> R2 --> R3 --> R4 --> R5
    end

    S1 & S2 & S3 --> Step2
    Step2 --> Step3
    Step3 -->|"Claim created<br/>with SchemaSet"| Runtime

    style Step1 fill:#e8f4fd,stroke:#2196F3
    style Step2 fill:#fff3e0,stroke:#FF9800
    style Step3 fill:#e8f5e9,stroke:#4CAF50
    style Runtime fill:#f3e5f5,stroke:#9C27B0

Data Model

flowchart LR
    Claim["🗂️ Claim"] -->|"assigned to"| SchemaSet["📂 SchemaSet"]
    SchemaSet -->|"contains"| Schema["🗎 Schema"]
    Schema -->|"stores .py file"| Blob["💾 Blob Storage"]

Schema — one per document type. Metadata in Cosmos DB, .py class file in Blob Storage.
SchemaSet — a named group that holds references to one or more Schemas. Assigned to a Claim at creation time.
A Schema can belong to multiple SchemaSets or none at all.

Step 1: Create Schema Class (.py)

A new class needs to be created that defines the schema as a strongly typed Python class inheriting from Pydantic BaseModel.

Schema Folder: /src/ContentProcessorAPI/samples/schemas/ — All schema classes should be placed into this folder

Sample Schemas: The accelerator ships with 4 sample schemas — use any as a starting template:

Schema	File	Class Name	Auto-registered
Auto Insurance Claim Form	autoclaim.py	`AutoInsuranceClaimForm`	✅
Police Report	policereport.py	`PoliceReportDocument`	✅
Repair Estimate	repairestimate.py	`RepairEstimateDocument`	✅
Damaged Vehicle Image	damagedcarimage.py	`DamagedVehicleImageAssessment`	✅

Note: All 4 schemas are automatically registered during deployment (via azd up or the register_schema.py script) and grouped into the "Auto Claim" schema set.

Duplicate one of these files and update with a class definition that represents your document type.

Tip: You can use GitHub Copilot to generate a schema. Example prompt:

Generate a Schema Class based on the following autoclaim.py schema definition, which has been built and derived from Pydantic BaseModel class. The generated Schema Class should be called "Freight Shipment Bill of Lading" schema file. Please define the entities based on standard bill of lading documents in the logistics industry.

Class Structure

Each schema .py file must include:

from pydantic import BaseModel, Field
from typing import List, Optional

class SubModel(BaseModel):
    """Description of this sub-entity — used as LLM context."""
    
    field_name: Optional[str] = Field(
        description="What this field represents, e.g. Consignee company name"
    )

class MyDocumentSchema(BaseModel):
    """Top-level description of the document type."""
    
    some_field: Optional[str] = Field(description="...")
    sub_entity: Optional[SubModel] = Field(description="...")
    
    @staticmethod
    def example() -> "MyDocumentSchema":
        """Returns an empty instance of this schema."""
        return MyDocumentSchema(some_field="", sub_entity=SubModel.example())
    
    @staticmethod
    def from_json(json_str: str) -> "MyDocumentSchema":
        """Creates an instance from a JSON string."""
        return MyDocumentSchema.model_validate_json(json_str)
    
    def to_dict(self) -> dict:
        """Converts this instance to a dictionary."""
        return self.model_dump()

Key Rules

Element	Requirement
Inheritance	All classes must inherit from `pydantic.BaseModel`
Field descriptions	Every field must have a `description=` — this is the prompt text the LLM uses for extraction. Include examples for better accuracy (e.g., `"Date of loss, e.g. 01/15/2026"`)
Optional vs Required	Use `Optional[str]` for fields that may not be present in every document
Subclasses	Use nested `BaseModel` classes for complex entities (address, line items, etc.)
Required methods	`example()`, `from_json()`, `to_dict()` — all three must be present
Class docstring	Include a description — it's used as context during mapping

Step 2: Register Schemas

After creating your .py class files, register each schema in the system. Registration uploads the class file to Blob Storage and stores metadata in Cosmos DB.

Option A: Register via API (individual)

Endpoint: POST /schemavault/ (multipart/form-data)

Part	Type	Description
`schema_info`	JSON string	`{"ClassName": "MyDocumentSchema", "Description": "My Document"}`
`file`	File upload	The `.py` class file (max 1 MB)

Example using the REST Client extension:

Note: Install the REST Client VSCode extension to execute .http files directly in VS Code.

Sample requests: /src/ContentProcessorAPI/test_http/invoke_APIs.http

The response returns a Schema Id — save this for Step 3.

Option B: Register via script (batch)

Note: The default sample schemas are registered automatically during azd up via the post-provisioning hook. You only need to run the script manually if you are adding custom schemas or if automatic registration was skipped.

For bulk registration, use the provided script with a JSON manifest. The script performs three steps automatically:

Registers individual schema files via /schemavault/
Creates a schema set via /schemasetvault/
Adds each registered schema into the schema set

Manifest file (schema_info.json):

{
  "schemas": [
    { "File": "autoclaim.py",       "ClassName": "AutoInsuranceClaimForm",       "Description": "Auto Insurance Claim Form" },
    { "File": "damagedcarimage.py", "ClassName": "DamagedVehicleImageAssessment","Description": "Damaged Vehicle Image Assessment" },
    { "File": "policereport.py",    "ClassName": "PoliceReportDocument",         "Description": "Police Report Document" },
    { "File": "repairestimate.py",  "ClassName": "RepairEstimateDocument",       "Description": "Repair Estimate Document" }
  ],
  "schemaset": {
    "Name": "Auto Claim",
    "Description": "Claim schema set for auto claims processing"
  }
}

Run the script:

cd src/ContentProcessorAPI/samples/schemas
python register_schema.py <API_BASE_URL> schema_info.json

The script checks for existing schemas and schema sets to avoid duplicates, and outputs the registered Schema IDs and Schema Set ID.

Schema API Reference

Method	Endpoint	Purpose
`GET`	`/schemavault/`	List all registered schemas
`POST`	`/schemavault/`	Register a new schema (multipart upload)
`PUT`	`/schemavault/`	Update an existing schema
`DELETE`	`/schemavault/`	Delete a schema by ID
`GET`	`/schemavault/schemas/{schema_id}`	Get a schema by ID (includes `.py` file)

Step 3: Create SchemaSet and Add Schemas

A SchemaSet groups your registered schemas together for claim processing. When a claim is created, it is assigned a SchemaSet — the Web UI presents the schemas within the set as available document types for upload.

3a. Create a SchemaSet

Endpoint: POST /schemasetvault/

{
  "Name": "FNOL Auto Claims",
  "Description": "Schemas for auto insurance FNOL claim processing"
}

The response returns a SchemaSet Id — use this in the next step.

3b. Add Schemas to the SchemaSet

Endpoint: POST /schemasetvault/{schemaset_id}/schemas

For each schema registered in Step 2, add it to the set:

{
  "SchemaId": "abc123"
}

Repeat for each schema. The SchemaSet now holds references to all your document type schemas.

SchemaSet API Reference

Method	Endpoint	Purpose
`GET`	`/schemasetvault/`	List all schema sets
`POST`	`/schemasetvault/`	Create a new schema set
`GET`	`/schemasetvault/{schemaset_id}`	Get a schema set by ID
`DELETE`	`/schemasetvault/{schemaset_id}`	Delete a schema set
`GET`	`/schemasetvault/{schemaset_id}/schemas`	List schemas in a set
`POST`	`/schemasetvault/{schemaset_id}/schemas`	Add a schema to a set
`DELETE`	`/schemasetvault/{schemaset_id}/schemas/{schema_id}`	Remove a schema from a set

How Schemas Are Used at Runtime

Once schemas are registered and grouped into a SchemaSet, the pipeline uses them automatically during the Map step:

Schema lookup — The Map handler reads the Schema_Id from the processing queue message, then fetches metadata from Cosmos DB
Dynamic class loading — Downloads the .py file from Blob Storage and dynamically loads the Pydantic class
JSON Schema generation — Calls model_json_schema() on the class to produce a full JSON Schema with all field descriptions
LLM extraction — Embeds the JSON Schema into the GPT-5.1 system prompt with response_format for structured JSON output (temperature=0.1 for deterministic results)
Validation & scoring — Parses the GPT response back into the Pydantic class, then computes per-field confidence scores using log-probabilities

This means your field descriptions in the schema class directly influence extraction quality — write clear, specific descriptions with examples for best results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Customizing Schema and Data

How to Use Your Own Data

Schema & SchemaSet Structure

Data Model

Step 1: Create Schema Class (.py)

Class Structure

Key Rules

Step 2: Register Schemas

Option A: Register via API (individual)

Option B: Register via script (batch)

Schema API Reference

Step 3: Create SchemaSet and Add Schemas

3a. Create a SchemaSet

3b. Add Schemas to the SchemaSet

SchemaSet API Reference

How Schemas Are Used at Runtime

Related Documentation

FilesExpand file tree

CustomizeSchemaData.md

Latest commit

History

CustomizeSchemaData.md

File metadata and controls

Customizing Schema and Data

How to Use Your Own Data

Schema & SchemaSet Structure

Data Model

Step 1: Create Schema Class (.py)

Class Structure

Key Rules

Step 2: Register Schemas

Option A: Register via API (individual)

Option B: Register via script (batch)

Schema API Reference

Step 3: Create SchemaSet and Add Schemas

3a. Create a SchemaSet

3b. Add Schemas to the SchemaSet

SchemaSet API Reference

How Schemas Are Used at Runtime

Related Documentation