Tuesday, August 05, 2025

Steps for training a custom document extraction model on Azure AI

Given below is a step-by-step guide on how to use bounding boxes to train custom document models in Azure Vision + Document AI:

Step 1: Prepare Your Document Samples

Collect a minimum of about 5-10 sample documents representative of the type you want the model to learn. Ensure the documents contain the fields or visual elements you want to extract (e.g., invoice numbers, tables, checkboxes).

Step 2: Upload Documents to Azure Document Intelligence Studio or AI Foundry Portal

Navigate to the Azure Document Intelligence Studio or the AI Foundry portal. Create a new custom model project and upload your labeled documents here.

Step 3: Annotate the Documents with Bounding Boxes

Open each document in the annotation tool. Use the interface to draw bounding boxes around each field or element you want your model to detect. For example, draw a rectangle around the "Invoice Number" field or the table area. Assign a meaningful label/tag to each bounding box (e.g., "InvoiceNumber," "TotalAmount," "Table").

Step 4: Review and Adjust Annotations

Carefully review each bounding box for accuracy and completeness. Adjust sizes and positions as needed to tightly encase the relevant text or visual elements.

Step 5: Train the Custom Model

Once all documents are annotated, start the training process. The AI will learn to recognize visually similar regions and extract text or data associated with each labeled bounding box.

Step 6: Evaluate the Model

Test the model using a set of new, unseen documents. Review the extracted fields to check accuracy and completeness. If necessary, add more labeled documents or refine annotations and retrain.

Step 7: Deploy and Use the Model

When satisfied with the model’s performance, deploy it via the Azure portal. You can now integrate the model through APIs or SDKs to automate document processing in your applications.

This bounding-box annotation process is crucial for training effective custom document AI models in Azure Vision + Document AI, ensuring the system understands exactly where and what information to extract from documents.

Azure Vision + Document AI supports two main types of custom models:

  1. Custom Template Model (formerly Custom Form Model): Best for documents with a consistent and static layout or visual template (e.g., questionnaires, structured forms, applications). Extracts labeled key-value pairs, selection marks (checkboxes), tables, signature fields, and regions from documents with little variation in structure.
  2. Custom Neural Model (also called Custom Document Model): Designed for documents with more layout variation, including structured, semi-structured, or unstructured document types (e.g., invoices, receipts, purchase orders). Uses deep learning trained on a base of diverse document types and fine-tuned on your labeled dataset. Recommended for higher accuracy and advanced extraction scenarios when documents vary in layout or complexity. 
The custom neural model in Azure Vision + Document AI is based on Microsoft's proprietary deep learning architecture specifically designed for document understanding. It as a deep learning model trained on a large collection of documents and then fine-tuned on your labeled dataset to recognize key-value pairs, tables, selection marks, and signatures in structured, semi-structured, and unstructured documents. 

Behind the scenes, the architecture likely combines convolutional neural networks (traditional Computer Vision CNN like YOLO) for layout/visual understanding together with transformer-based LMMs (large multi-model models) or sequence models for text and contextual understanding. This hybrid use of vision and language models is what enables the service to process multi-modal inputs (visual layout plus text) effectively.

Important Note: Before you embark on creating a custom fine-tuned neural net model, please check if your usecase can be satisfied with the pre-built models (which will be true for 90% of the usecases).

A lot of usecases can just be fulfilled by using the "Layout analysis model with the optional query string parameter features=keyValuePairs enabled"


No comments:

Post a Comment