OCR input requirements

JWall · ‎07-29-2022

Hello,

I am attempting to create an AI to recognise custom document information, and I am coming into multiple problems with OCR correctly identifying text. I realise this is a known problem and is being countermeasured with constant updates to the OCR model as per power automate community forum post 'Problem with Model recognising Zero and letter O'

I noticed one of the input requirements for OCR is ~8pt font text in order to Read. When analising the document type I am using train the AI model, at standard size, the font is ~8pt

Now, I see where this could be a problem with a document that employs raster imaging, in which the number of pixels in an image is predetermined, and when you zoom in the document appears to have lower resolution. The document type that I am attempting to analyse however appears to be utilising vector imaging, in which shapes are determined by a set of geometrical equations and resolution scales up the more you zoom in

My question; does the imaging technique for a document have an effect in the ability of OCR to correctly identify text? This issue is not as prevalent with larger style letters within the same document (same imaging technique as smaller style letters)

Appreciate the help!

JoeF-MSFT · ‎07-30-2022

Hi @JWall - thanks for the question and the detailed analysis.

A few things we can try to see if you see any impact:

If you navigate to the homepage of AI Builder (https://aka.ms/tryaibuilder) --> Select Text recognition / Extract all the text in photos and PDF documents (OCR). --> Upload new. Do you get the data extracted as you would expect? The text recognition model has been recently updated with OCR improvements.
If you print the original PDF as a new PDF, and use the newly generated PDF on the text recognition model as the step before, do you see better results?
If you try to transform one of the pages of the PDF document into an image (for example by taking a screenshot) and try here again with text recognition, do you notice any improvements?

JWall · ‎08-01-2022

Hi @JoeF-MSFT

Thanks for the reply!

TL;DR - Trying the different methods appeared to have no impact to improving results. Not sure if there are any other methods/variables to test. A suggested feature I could make though would be to allow for manual entries in AI builder, where you still highlight the field in which you want the model to read, but if the model is unable to correctly read the text, then allow for an option to manually edit the read value for the field. Adding incorporation to the MS OCR recognition model to allow for improvement to that as well as improving end user AI models would greatly help the robustness and flexibility of AI builder.

Unfortunately I am unable to upload as detailed of a report as last time due to sensitive information, however; I did run through analysis on the situations you suggested. Utilising the 'extract all text in photos and PDF documents (OCR)' default model and uploading my original document, a version of the document that was printed as a new PDF -> SaveAs, and finally a version of the document that was taken as a screenshot and saved as a JPEG. I also tried a version of the document that was taken as a screenshot and saved as a PDF after seeing the results.

From a character count perspective, the results from what the AI reads are as follows:

LEN(.pdforiginal)	799
LEN(.jpgss)	920
LEN(.pdfprint)	799
LEN(.pdfss)	500

The original PDF and printing ->saveas PDF yielded the same results. Interestingly; the screenshot -> JPEG had the highest character count, while the screenshot -> PDF had the lowest character count.

Now when comparing to the actual data, I am unable to get an exact character count on the original PDF without meticulously counting it myself. What I can tell, is none of the AI read results correctly extracted the data as I would expect. For example the sample document has a total of 18 'A's in a table (similar to that of my previous post). None of the AI read results showed any amount of consistency in 1. Detecting a 'word', 2. Correctly identifying the 'word'. I think at best, surprisingly the jpeg version performed the best at the specific task correctly identifying ~8 'A's, but again; not to adequate result. The original PDF appeared to correctly identify the most amount of characters, which can help explain why the .jpg version identified more characters. A prime example of this would be the .jpg version identifying a column line as an 'l'.

Not sure if there are any other methods I could try to help troubleshoot or test for better methods. Other than patiently waiting for improvements to the character recognition AI model. A suggestion I could make though would be to allow for manual entries in AI builder, where you still highlight the field in which you want the model to read, but if the model is unable to correctly read the text, then allow for an option to manually edit the read field value. Adding incorporation to the MS OCR recognition model to allow for improvement to that as well as improving end user AI models would greatly help the robustness and flexibility of AI builder.

Appreciate the help, and let me know if you have any more thoughts. Thanks!

JoeF-MSFT · ‎08-02-2022

Hi @JWall - I really appreciate the detailed investigations! And thanks for the feedback of allowing to provide feedback on the detected words while tagging the documents. This is something that indeed we don't have today.

I'm curious about those 'A's that are not detected. I understand that the documents contain sensitive information. Would it be possible to share just a screenshot of a word where an 'A' is not detected? Or maybe a partial screenshot of that word?

JWall · ‎08-03-2022

Hi @JoeF-MSFT - Sure.

For this specific example I have an array of letters in a table. The letters aren't always 'A', nor are they always aligned in a linear layout pattern. The first screenshot helps show an example of a letter not being detected. All 'B's are detected by the OCR software except for the 'B' highlighted in red. The other 'B's that are either not showing up in the table on the right, or misplaced in the table on the right can easily be fixed by moving the column line, and are correctly identified as text by OCR. You may also notice that the array has '.' in some of the fields. Sometimes these are detected, and sometimes they are not and to which is varying degrees of success. I am not so concerned with this as '.' can also be treated as a blank in my use case, but you may find this interesting for your use.

This is a slightly different example where the OCR is detecting a column line as text ('|'). Sometimes OCR will detect column lines as 'l'.

Again, I want to highlight I presumed this problem to possibly be due to the OCR not being accurate with font sizes <=~8pt, but was curious as to if the image processing type would have an effect on that (vector vs raster) (these PDFs used were vector).

On an alternate note, I am going to attempt to work around the need to use OCR. These documents I was trying to use with OCR are all standard tables internal to our company. The thing is that they are 1. versions of pivot tables that make it easier for a human to read, 2. in PDF format and thus not as easy for a computer to read (the problem I was trying to solve with OCR). I am working with some people in my company to gain some additional information, but I would go to think there is some more raw data that are driving these PDF documents (strings in an array, tabular format or something like that), something which a computer may have a bit of an easier time reading. If this information exists and I can get access to it, then I should be able to skip over the process of running through OCR & creating an AI to recognise custom document information.

Anyways, I appreciate the help, and hopefully I was able to help with your inquiry.