Troubles with multipage table data recognition

TFranchina · ‎09-27-2022

Hi,

I have created a model in order to extract 2 fields and one table of 2 columns. (537d99bf-3e12-428e-87d7-29061712abd4)

This table can be composed of one single line up to spreading on 4-5 pages. In such case layout of the first page is different form the successive ones. I am facing issues to extract information beyond page 2.

As part of the training of the moment i have created collections of document with table on 1, 2, 3, 4+ pages.

When tagging the table on pages 3 and 4, i get a notification in the model that tagging the multipage table on more than 2 pages can affect its capacity. Can that be related to the difficulty to extract information ?

In addition, this model is supposed to be linked with a Flow to manipulate the extracted table and send it via email. At the moment, we are trying to encapsulate the model in a Do-until loop to force it to process each pages and aggregate an array variable but this is creating some issues as well in the Flow.

Would you have any recommendation in the model creation, set-up, training in order to ensure it process all pages ?

Thank you in advance for your support

Thomas Franchina

JoeF-MSFT · ‎09-27-2022

Hi @TFranchina - thank you for sharing this.

Today multipage table extraction is an experimental feature and we're aware of some cases where the extraction might not work well as it seems to be your case. Processing page by page as you have done using this template is our recommendation https://learn.microsoft.com/en-us/ai-builder/form-processing-multipage#use-a-cloud-flow-to-process-a...

The good news is that in the second half of October we'll be releasing an improved version of multipage tables that should give you better results. Extract content from multipage tables in document processing | Microsoft Learn Stay tuned. 🙂

TFranchina · ‎01-04-2023

Hi @JoeF-MSFT ,

Wishing you the best for this starting new year, i'd like to reactivate this discussion as i'm still facing issues with the multipage table processing with AI Builder.

Improvements released in Oct-Nov 2022 really improved the stability of the system (👍), i'm still having difficulties to reach reliable extract quality i need for my use case.

I'm now at the v5 of my model (ID: 2c786418-ba41-4c16-afbf-c419eadb57ce) trained with 30+ examples of structured document where i want to extract:

one date field (99% accuracy)
one text field (99% accuracy)
2-column table with variable lenght -> that's where the troubles starts

i can't predict in advance the length of the table in the file. it can be 2 lines (1 header, 1 content row) or up to X pages full of rows with repeated headers. usually X = 2-3-4 but can go up to 9...

From the tests i've made, the table is recognized correctly down to 3-4 pages without issues but after that i loose significant accuracy and extract quality.

Today i've added to the collection the latest example of 9 pages-table i have received, tagging all lines manually. I launched the model training over lunch break and when back, i checked the performance doing a quick test with the exact same file. It resulted with only the first 3 pages of table recognized, nothing more as if it stopped... While it has many examples of 3+ tables in its collection...

Any ideas for improvement i could try ?

As i can't transfer a collection from one model to the other i'd like to avoid creating a v6 and re-starting of the collection tagging again.

Thanks in advance for the support

T. Franchina

JoeF-MSFT · ‎01-04-2023

Hi @TFranchina - thank you for the nice wishes. Cheers to a fresh start and endless possibilities in 2023!

Let's try the following, edit your existing model and set the document type on the first step as Unstructured documents. This uses a newer AI technology that might give you better results. It works for both structured and unstructured document types.

Let us know how this goes!

TFranchina · ‎01-05-2023

Hi @JoeF-MSFT ,

Thank you for the prompt feedback. I modified the model accordingly and trained it again.

After several "quick tests" with the examples giving me the worst results so far, it seems indeed this is solving the main issue resulting in undetected cell content. 👍

I will continue to monitor the tests but so far, no more empty cells. Actually it would be the opposite, it now results the repeated table headers while not needed...

Keeping you posted of the progress.

All the best

T.Franchina