Files processed by the solution are mapped and transformed into schemas — strongly typed Pydantic class definitions that represent a standardized output for each document type. For example, the accelerator includes an AutoInsuranceClaimForm schema with fields like policy_number, date_of_loss, and vehicle_information.
Using AI, the processing pipeline extracts content from each document (text, images, tables), then maps the extracted data into the schema fields using GPT-5.1 with structured JSON output — field descriptions in the schema class act as extraction guidance for the LLM.
Schemas need to be created specific to your business and domain requirements. A lot of times schemas may be generally common across industries, but this allows for variations specific to your use case.
Before processing documents, schemas must be registered in the system and grouped into schema sets. The diagram below shows the three-step preparation flow and how schemas are used at runtime:
flowchart TB
subgraph Step1["<b>Step 1: Register Schemas</b> (one per document type)<br/>POST /schemavault/ × N"]
S1["🗎 AutoInsuranceClaimForm<br/><i>autoclaim.py</i><br/>Schema ID: abc123"]
S2["🗎 PoliceReportDocument<br/><i>policereport.py</i><br/>Schema ID: def456"]
S3["🗎 RepairEstimateDocument<br/><i>repairestimate.py</i><br/>Schema ID: ghi789"]
S4["🗎 ...<br/><i>more schemas</i>"]
end
subgraph Step2["<b>Step 2: Create SchemaSet</b><br/>POST /schemasetvault/"]
SS["📂 SchemaSet: FNOL Auto Claims<br/>SchemaSet ID: ss-001<br/>Schemas: [ ] <i>(empty)</i>"]
end
subgraph Step3["<b>Step 3: Add Schemas to SchemaSet</b><br/>POST /schemasetvault/{ss-001}/schemas × N"]
SS_FULL["📂 SchemaSet: FNOL Auto Claims (ss-001)"]
SS_S1["🗎 AutoInsuranceClaimForm<br/>(abc123)"]
SS_S2["🗎 PoliceReportDocument<br/>(def456)"]
SS_S3["🗎 RepairEstimateDocument<br/>(ghi789)"]
SS_FULL --- SS_S1
SS_FULL --- SS_S2
SS_FULL --- SS_S3
end
subgraph Runtime["<b>Runtime — Pipeline Map Step</b>"]
R1["1. Look up Schema metadata<br/>from Cosmos DB"]
R2["2. Download .py class file<br/>from Blob Storage"]
R3["3. Dynamically load Pydantic class<br/>→ generate JSON Schema"]
R4["4. Embed JSON Schema in<br/>GPT-5.1 prompt"]
R5["5. Validate response with<br/>Pydantic → confidence scoring"]
R1 --> R2 --> R3 --> R4 --> R5
end
S1 & S2 & S3 --> Step2
Step2 --> Step3
Step3 -->|"Claim created<br/>with SchemaSet"| Runtime
style Step1 fill:#e8f4fd,stroke:#2196F3
style Step2 fill:#fff3e0,stroke:#FF9800
style Step3 fill:#e8f5e9,stroke:#4CAF50
style Runtime fill:#f3e5f5,stroke:#9C27B0
flowchart LR
Claim["🗂️ Claim"] -->|"assigned to"| SchemaSet["📂 SchemaSet"]
SchemaSet -->|"contains"| Schema["🗎 Schema"]
Schema -->|"stores .py file"| Blob["💾 Blob Storage"]
- Schema — one per document type. Metadata in Cosmos DB,
.pyclass file in Blob Storage. - SchemaSet — a named group that holds references to one or more Schemas. Assigned to a Claim at creation time.
- A Schema can belong to multiple SchemaSets or none at all.
A new class needs to be created that defines the schema as a strongly typed Python class inheriting from Pydantic BaseModel.
Schema Folder: /src/ContentProcessorAPI/samples/schemas/ — All schema classes should be placed into this folder
Sample Schemas: The accelerator ships with 4 sample schemas — use any as a starting template:
| Schema | File | Class Name | Auto-registered |
|---|---|---|---|
| Auto Insurance Claim Form | autoclaim.py | AutoInsuranceClaimForm |
✅ |
| Police Report | policereport.py | PoliceReportDocument |
✅ |
| Repair Estimate | repairestimate.py | RepairEstimateDocument |
✅ |
| Damaged Vehicle Image | damagedcarimage.py | DamagedVehicleImageAssessment |
✅ |
Note: All 4 schemas are automatically registered during deployment (via
azd upor theregister_schema.pyscript) and grouped into the "Auto Claim" schema set.
Duplicate one of these files and update with a class definition that represents your document type.
Tip: You can use GitHub Copilot to generate a schema. Example prompt:
Generate a Schema Class based on the following autoclaim.py schema definition, which has been built and derived from Pydantic BaseModel class. The generated Schema Class should be called "Freight Shipment Bill of Lading" schema file. Please define the entities based on standard bill of lading documents in the logistics industry.
Each schema .py file must include:
from pydantic import BaseModel, Field
from typing import List, Optional
class SubModel(BaseModel):
"""Description of this sub-entity — used as LLM context."""
field_name: Optional[str] = Field(
description="What this field represents, e.g. Consignee company name"
)
class MyDocumentSchema(BaseModel):
"""Top-level description of the document type."""
some_field: Optional[str] = Field(description="...")
sub_entity: Optional[SubModel] = Field(description="...")
@staticmethod
def example() -> "MyDocumentSchema":
"""Returns an empty instance of this schema."""
return MyDocumentSchema(some_field="", sub_entity=SubModel.example())
@staticmethod
def from_json(json_str: str) -> "MyDocumentSchema":
"""Creates an instance from a JSON string."""
return MyDocumentSchema.model_validate_json(json_str)
def to_dict(self) -> dict:
"""Converts this instance to a dictionary."""
return self.model_dump()| Element | Requirement |
|---|---|
| Inheritance | All classes must inherit from pydantic.BaseModel |
| Field descriptions | Every field must have a description= — this is the prompt text the LLM uses for extraction. Include examples for better accuracy (e.g., "Date of loss, e.g. 01/15/2026") |
| Optional vs Required | Use Optional[str] for fields that may not be present in every document |
| Subclasses | Use nested BaseModel classes for complex entities (address, line items, etc.) |
| Required methods | example(), from_json(), to_dict() — all three must be present |
| Class docstring | Include a description — it's used as context during mapping |
After creating your .py class files, register each schema in the system. Registration uploads the class file to Blob Storage and stores metadata in Cosmos DB.
Endpoint: POST /schemavault/ (multipart/form-data)
| Part | Type | Description |
|---|---|---|
schema_info |
JSON string | {"ClassName": "MyDocumentSchema", "Description": "My Document"} |
file |
File upload | The .py class file (max 1 MB) |
Example using the REST Client extension:
Note: Install the REST Client VSCode extension to execute
.httpfiles directly in VS Code.
Sample requests: /src/ContentProcessorAPI/test_http/invoke_APIs.http
The response returns a Schema Id — save this for Step 3.
Note: The default sample schemas are registered automatically during
azd upvia the post-provisioning hook. You only need to run the script manually if you are adding custom schemas or if automatic registration was skipped.
For bulk registration, use the provided script with a JSON manifest. The script performs three steps automatically:
- Registers individual schema files via
/schemavault/ - Creates a schema set via
/schemasetvault/ - Adds each registered schema into the schema set
Manifest file (schema_info.json):
{
"schemas": [
{ "File": "autoclaim.py", "ClassName": "AutoInsuranceClaimForm", "Description": "Auto Insurance Claim Form" },
{ "File": "damagedcarimage.py", "ClassName": "DamagedVehicleImageAssessment","Description": "Damaged Vehicle Image Assessment" },
{ "File": "policereport.py", "ClassName": "PoliceReportDocument", "Description": "Police Report Document" },
{ "File": "repairestimate.py", "ClassName": "RepairEstimateDocument", "Description": "Repair Estimate Document" }
],
"schemaset": {
"Name": "Auto Claim",
"Description": "Claim schema set for auto claims processing"
}
}Run the script:
cd src/ContentProcessorAPI/samples/schemas
python register_schema.py <API_BASE_URL> schema_info.jsonThe script checks for existing schemas and schema sets to avoid duplicates, and outputs the registered Schema IDs and Schema Set ID.
| Method | Endpoint | Purpose |
|---|---|---|
GET |
/schemavault/ |
List all registered schemas |
POST |
/schemavault/ |
Register a new schema (multipart upload) |
PUT |
/schemavault/ |
Update an existing schema |
DELETE |
/schemavault/ |
Delete a schema by ID |
GET |
/schemavault/schemas/{schema_id} |
Get a schema by ID (includes .py file) |
A SchemaSet groups your registered schemas together for claim processing. When a claim is created, it is assigned a SchemaSet — the Web UI presents the schemas within the set as available document types for upload.
Endpoint: POST /schemasetvault/
{
"Name": "FNOL Auto Claims",
"Description": "Schemas for auto insurance FNOL claim processing"
}The response returns a SchemaSet Id — use this in the next step.
Endpoint: POST /schemasetvault/{schemaset_id}/schemas
For each schema registered in Step 2, add it to the set:
{
"SchemaId": "abc123"
}Repeat for each schema. The SchemaSet now holds references to all your document type schemas.
| Method | Endpoint | Purpose |
|---|---|---|
GET |
/schemasetvault/ |
List all schema sets |
POST |
/schemasetvault/ |
Create a new schema set |
GET |
/schemasetvault/{schemaset_id} |
Get a schema set by ID |
DELETE |
/schemasetvault/{schemaset_id} |
Delete a schema set |
GET |
/schemasetvault/{schemaset_id}/schemas |
List schemas in a set |
POST |
/schemasetvault/{schemaset_id}/schemas |
Add a schema to a set |
DELETE |
/schemasetvault/{schemaset_id}/schemas/{schema_id} |
Remove a schema from a set |
Once schemas are registered and grouped into a SchemaSet, the pipeline uses them automatically during the Map step:
- Schema lookup — The Map handler reads the
Schema_Idfrom the processing queue message, then fetches metadata from Cosmos DB - Dynamic class loading — Downloads the
.pyfile from Blob Storage and dynamically loads the Pydantic class - JSON Schema generation — Calls
model_json_schema()on the class to produce a full JSON Schema with all field descriptions - LLM extraction — Embeds the JSON Schema into the GPT-5.1 system prompt with
response_formatfor structured JSON output (temperature=0.1 for deterministic results) - Validation & scoring — Parses the GPT response back into the Pydantic class, then computes per-field confidence scores using log-probabilities
This means your field descriptions in the schema class directly influence extraction quality — write clear, specific descriptions with examples for best results.
- Modifying System Processing Prompts — Customize extraction and mapping prompts
- Gap Analysis Ruleset Guide — Define gap rules that reference your document types
- Processing Pipeline Approach — 4-stage extraction pipeline (Extract → Map → Evaluate → Save)
- API Documentation — Full API endpoint reference
- Claim Processing Workflow — End-to-end workflow architecture
