cancel
Showing results for 
Search instead for 
Did you mean: 
Jay-Encodian

Extract data from documents with Microsoft Flow

Many of us over time will have worked on projects/solutions where there is a requirement to extract data from documents. A common scenario could be processing a scanned document or processing documents sent from an external source, commonplace in 'Invoice Processing' scenarios.

This step by step guide details how to configure a Microsoft Flow to extract data from a document and add to the document as metadata.

For the purposes of this guide we'll use this simple scenario;

The finance department generates invoices using a third party application which uploads the documents to a SharePoint library for storage. To enable invoice reporting, tracking and related activities we have a requirement to extract data from each invoice and add as metadata to the document.

The SharePoint library is configured as follows:

This image has an empty alt attribute

Lets create the Flow;

1. Create a new Flow using the 'Automated -- from blank' option

1.png

2. Enter a name for the Flow, select the SharePoint 'When a file is created in a folder' trigger, click 'Create'

2.png

3. Configure the 'When a file is created in a folder ' trigger action setting the 'Site Address' and 'Folder Id' fields to the location where documents will be added.

3.png

NOTE: For this demo; documents will already be in PDF format. However, should there be a need to extract data from a Word document, PowerPoint file, CAD drawing etc. simply convert to PDF first using the Encodian 'Convert to PDF' action

4. Add the Encodian 'Extract Text Regions' action

4.b. Filename: Select the 'File name' property from the ' When a file is created in a folder' action

5.png

4.c. File Content: Select the 'File Content' property from the ' When a file is created in a folder' action

6.png

To progress the configuration of the 'Extract Text Regions' action we need to provide co-ordinates of the data on the source document, i.e. Zonal extraction.

So how do we get the coordinates? Easy! simply use the 'Text Region Generator' utility found in the Encodian administration portal.

4.d. Upload a sample PDF document

7.png

4.e. Drag and move the area selector to the target area of the document

8.png

4.f. Define a name for the region and then click 'Add to JSON'

9.png

4.g. Repeat this process for all target regions of the document.

4.h. Copy the generated JSON data into your clipboard

10.png

4.i. Go back to Microsoft Flow; On the 'Extract Text Regions' action, click the 'Switch to input entire array' icon

11.png

4.j. Copy and past the JSON data obtained in step 4.h. into the 'Text Regions' field

12.png

5. We now need to obtain a sample of the generated JSON data which will enable us to add additional actions to parse and use the returned JSON data.

5.a. Test the Flow using your preferred method, click 'Save & Test'

13.png

5.b. For this example I selected 'I'll perform the trigger action' which I invoked by manually uploading a PDF invoice document to the SharePoint library aligned to the configuration of the trigger action (step 3)

5.c. Once the Flow has executed open the 'Extract Text Regions' action, copy the 'Simple Text Region Results'JSON returned.

14.png

NOTE: If you have submitted a large file Flow may display the outputs differently prompting you to manually download the output. See the example below:

16.png

Should this occur you'll need to manually download the payload, locate the 'Simple Text Region Results' variable. You'll also need to manually remove any escape characters '\' using either a text/code editor or an online service such as https://www.freeformatter.com/json-escape.html

6. Add a 'Parse JSON' action

6.a. Content: Select the 'Simple Text Region Results' property from the ' Extract Text Regions ' action

17.png

6.b. Click 'Use sample payload to generate schema'

18.png

6.c. Paste the 'Simple Text Region Results' obtained in step 5.c into the text-area control, click 'Done'

19.png

7. Add a 'Get file metadata using path' action

7.a. Site Address: Set as per step 3.

7.b. File Path: Select the 'File path' property from the ' When a file is created in a folder' action.

20.png

8. Add an 'Update File Properties' action

8.a. Site Address: Set as per step 3.

8.b. Library Name: Set as per the library name contained within the 'Folder Id' property of step 3.

8.c. Id: Select the 'ItemId' property from the 'Get file metadata using path' action

21.png

8.d. Map data from the 'Parse JSON' action to the relevant fields

22.png

9. Test the Flow by using data from the previous run

23.png

10. Validate the flow run has successfully executed

24.png

11. Validate data has been extracted and added as document metadata correctly

25.png

While this example has focused on how to extract document data before setting SharePoint document metadata, once the data has been extracted you can literally do anything with the data using the power of Microsoft Flow!

Thanks for reading!

Comments
Anonymous

Hi Jay,

 

Thanks for this helpful post. However, in step 4, I cannot find File Name or File Content in the dynamic content drop-down. Has those names changed since your post?

 

Thanks

Hadi

Hi Hadi,

No, the dynamic data in Step 4 is obtained from the SharePoint trigger action.

Can you either create a post or contact Encodian support providing full details of your Flow so I can help?

Thanks Jay

@Jay-Encodian  You don't mention with a single word in your opening post that the method you describe requires a paid third party tool. Which your company builds. 

 

So, this is advertising.

 

Not useful. Rather annoying.

@teylyn1 The Encodian connector is available under a free plan, so you can do this without incurring any cost unless you have a higher throughput requirement which would require a paid plan. 

Encodian have thousands of M365 tenants utilizing the free plan so it must be useful for some

Fair call, that this could / should be highlighted at the start of the post

If anyone wants to extract data from a PDF or image without training a model for select documents, try this new GPT data extraction method: https://powerusers.microsoft.com/t5/Power-Automate-Cookbook/Extract-Data-From-PDFs-and-Images-With-G...

 

It doesn’t require specifying certain document areas, wordings, styles, etc. It just OCRs the file, converts it to a replica text (txt), and passes it to a GPT prompt where you can ask GPT to do whatever you want with the document data.

About the Author
  • Experienced Consultant with a demonstrated history of working in the information technology and services industry. Skilled in Office 365, Azure, SharePoint Online, PowerShell, Nintex, K2, SharePoint Designer workflow automation, PowerApps, Microsoft Flow, PowerShell, Active Directory, Operating Systems, Networking, and JavaScript. Strong consulting professional with a Bachelor of Engineering (B.E.) focused in Information Technology from Mumbai University.
  • I am a Microsoft Business Applications MVP and a Senior Manager at EY. I am a technology enthusiast and problem solver. I work/speak/blog/Vlog on Microsoft technology, including Office 365, Power Apps, Power Automate, SharePoint, and Teams Etc. I am helping global clients on Power Platform adoption and empowering them with Power Platform possibilities, capabilities, and easiness. I am a leader of the Houston Power Platform User Group and Power Automate community superuser. I love traveling , exploring new places, and meeting people from different cultures.
  • Read more about me and my achievements at: https://ganeshsanapblogs.wordpress.com/about MCT | SharePoint, Microsoft 365 and Power Platform Consultant | Contributor on SharePoint StackExchange, MSFT Techcommunity
  • Encodian Owner / Founder - Ex Microsoft Consulting Services - Architect / Developer - 20 years in SharePoint - PowerPlatform Fan
  • Founder of SKILLFUL SARDINE, a company focused on productivity and the Power Platform. You can find me on LinkedIn: https://linkedin.com/in/manueltgomes and twitter http://twitter.com/manueltgomes. I also write at https://www.manueltgomes.com, so if you want some Power Automate, SharePoint or Power Apps content I'm your guy 🙂
  • I am the Owner/Principal Architect at Don't Pa..Panic Consulting. I've been working in the information technology industry for over 30 years, and have played key roles in several enterprise SharePoint architectural design review, Intranet deployment, application development, and migration projects. I've been a Microsoft Most Valuable Professional (MVP) 15 consecutive years and am also a Microsoft Certified SharePoint Masters (MCSM) since 2013.
  • Big fan of Power Platform technologies and implemented many solutions.
  • Passionate #Programmer #SharePoint #SPFx #M365 #Power Platform| Microsoft MVP | SharePoint StackOverflow, Github, PnP contributor
  • Web site – https://kamdaryash.wordpress.com Youtube channel - https://www.youtube.com/channel/UCM149rFkLNgerSvgDVeYTZQ/