cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
Runner55552
Helper V
Helper V

Character recognition poor in 1930s vintage forms

I attempted to train a model using some type-written, but old, 1930s oil well permits. They are PDF scanned forms from a government agency, with typed answers generally in similar locations, so field location is ok. However, many of the answers were typed almost directly on the dotted lines of the forms, instead of slightly above. That seemed to cause poor character recognition, and a poor overall document processing model training score of 46%.

Any suggestions for improving this, such as removing all horizontal dotted lines within the PDF document? Or, are there simply some applications which need a person to extract the information.

I tried using "Enhance" in Adobe Pro, but that did not help.

Any suggestions are appreciated.

Runner55552_0-1657569238873.png

 

1 ACCEPTED SOLUTION

Accepted Solutions
Runner55552
Helper V
Helper V

Thanks for testing and suggesting this. I will give this a try.  We do have some tables in our files also, and I see the unstructured option does not yet have the ability to model tables. However, I think I can work around that for now, since most tables only have up to 4 rows, so I can "flatten" the records and consider them as individual fields for now.

View solution in original post

5 REPLIES 5
JoeF-MSFT
Power Apps
Power Apps

Hi @Runner55552 - this seems to be a fascinating project, bringing AI to documents from the 1930s! 🙂

 

I'd recommend training a new model and selecting Unstructured and free-form documents as the type of document. It might perform better with this type of document and uses a more up to date version of our OCR engine, which will also be available for structured and semi-structured documents by September.

 

JoeFMSFT_0-1657572363299.png

 

With the sample screenshot you provided, I can already see some improvements when selecting Unstructured and free-form documents.

 

When selecting Unstructured and free-form documents:

JoeFMSFT_3-1657572565475.png

 

When selecting structured and semi-structured documents:

JoeFMSFT_2-1657572549588.png

 

 

Runner55552
Helper V
Helper V

Thanks for testing and suggesting this. I will give this a try.  We do have some tables in our files also, and I see the unstructured option does not yet have the ability to model tables. However, I think I can work around that for now, since most tables only have up to 4 rows, so I can "flatten" the records and consider them as individual fields for now.

One other item I noticed while selecting/tagging text, in both the structured and unstructured model set-up. Even when highlighting only the text desired with a mouse, the text that appears as the value is the work/phrase that has already been recognized as text by the model.  So I cannot remove extraneous text or tic marks, etc. at the end of the word by sizing the selection box smaller. Will this be a feature for future enhancement?

Update:  I re-trained the model using the unstructured document option.  The accuracy of FINDING the location of the fields improved greatly. However, there were still many issues with actual character recognition, Biggest problems were:

1. type-written data was on or near a dotted line, producing phantom dots that were interpreted as periods or various characters.

2. The ' and " tick marks used in the form for feet and inches were often mis-interpreted as ! or 1 or other characters.

3. Many extra spaces inserted in various locations, although I could probably use Power Automate to remove "white space." 

4. Number values often wrong, again in part to the dotted lines, or other marks.

 

If there were the ability to manually train by telling the model during training that Oil equals Oil instead of 011, that would be very useful.

 

Thanks again for the helpful suggestions.

JoeF-MSFT
Power Apps
Power Apps

Hi @Runner55552, thanks a lot for taking the time to provide back this update! Great to hear that when using unstructured documents as an option the finding of the location improved greatly.

 

Today there is no option to adjust when selecting text to the character level, nor an option to provide feedback in the tagging process to change wrongly detected characters. As you suggested, one option is to do post-processing in a cloud flow in Power Automate with expressions to remove white spaces, dots, tick marks. 

Helpful resources

Announcements
Microsoft 365 Conference – December 6-8, 2022

Microsoft 365 Conference – December 6-8, 2022

Join us in Las Vegas to experience community, incredible learning opportunities, and connections that will help grow skills, know-how, and more.

Power Apps Ideas

Check out the New Ideas Site

We are excited to announce a new way to share your ideas for Power Apps!

Top Solution Authors
Users online (5,303)