I am encountering the error below when attempting to train a Form model, using multipage PDF files as the source. These files have already been optimized to reduce the data file size. Each of the 5 PDFs used in the attempted training contains no more than 50 pages.
"Fields could not be loaded for this document. It looks like this PDF document has many pages to process. We recommend that you split the PDF with only the pages you need to extract data from and upload the reduced document instead."
Does manually extracting the pages with the desired data for the training model negatively impact the real-world use of the model, which won't have pages manually extracted for future documents?
Hi @oldhamjr.
Thank you posting this question and apologies that you are experiencing this.
Unfortunately a PDF with 50 pages might be too many pages today to train a model. Do you need the model to extract data from all those pages? If not, then what you can do is generate 5 sample PDFs for trainign with only the pages where there is data you want to extract.
Once the model is trained, you can use the model in a cloud flow in Power Automate and you will be able to specify which pages to process as described here. This will also help reduce the cost, as the cost of using AI Builder is per page.
We're actively working so that beginning of next year, PDFs with 50 pages and beyond won't be an issue to train a form processing model.
Thank you for the quick response! Your comments regarding the training model make sense. Unfortunately, I won't know which pages contain the data, since these files are provided from many different third-party sources. Single page training to single page evaluation by Flow works well. Single-page training with a multi-page flow doesn't produce usable results in my table. Performing the upfront manual work to extract single pages for the Flow to scan seems like the only option with this tool. This doesn't seem to provide me with an easy, low-effort, solution for my end users. Is there an option for the AI model to evaluate an .txt file, if the data from the PDF is converted to text?
Thank you @oldhamjr for the feedback.
For documents with as many pages as 50 the recommendation today is to use the page range option in flow. We keep working to enable processing of up to 500 pages at once by beginning of next year.
User | Count |
---|---|
1 | |
1 | |
1 | |
1 | |
1 |