cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
Anonymous
Not applicable

Need Help Automating PDF Text Extraction

This may not be the correct board. If so, please let me know and I'll make adjustments as necessary.

 

I work in an office that receives monthly invoices for our company's billing. We get our invoices in PDF style formatted to be continuous tables. The data is laid out kind of like so:

 

-member ID#, NAME - new line

-Claim number, member ID# again - new line

-Date of claim twice, claim type code, claim type code again, claim type code again again, ID number, number of units for this claim, dollar amount requested, dollar amount authorized, dollar amount deducted, dollar amount paid - new line...

 

Currently I'm reviewing these documents by hand with my little human eyeballs and fingers. It's not terribly slow but I have literally hundreds of these and each can contain upwards of 50+ claims.

 

Specifically my goal is this: (somehow) scan the PDF document w/ RegEx (or something to similar effect) and extract the text from the three lines of text demonstrated above while filtering out the "noise" (everything else). The desired data range will always start on page 2 and end on 3 pages from the last page of the PDF.

 

I'm using FileCenter to manage my documents. FileCenter pro provides you a neat tool to do some OCR text extraction similarly to what I'm desiring. I managed to set up a little "Demo" module to prove the concept and it worked in microcosm. I couldn't get it to export the data into a file or into a program.

 

Ultimately these extracted data need to be placed into an Excel document so I can do some other data related operations on them.

 

The target PDFs contain sensitive data so I can't share them.

 

Any ideas how to go from selecting a PDF to dropping specific text from the PDF into an Excel doc while pruning all the unnecessary stuff? (or a list object in PAD?)

1 ACCEPTED SOLUTION

Accepted Solutions
Anonymous
Not applicable

Update:

 

I found the solution through RegEx and some highly specialized RegEx patterns developed by a friend for this particular case. For anyone with a similar need to take apart the contents of a PDF, then create a new variable out of the combined index value of each list of variables, please continue reading.

 

Step 1.: Get all files in designated target folder.

 

Step1.a: make sure your page ranges are good, I had to adjust mine to +1 page from page 1 (target page 2) and -3 pages from the last page (target page n-3, not the last three pages.)

 

Step 2. Create a list of the top level unique variable, mine is the member's name.

->Create new List: %YourListVar%

 

Step 3. Start a For Each loop, For %CurrentItem% in %Files%

 

Step 3a. Assign the PageStart# and PageEnd# to your page range variables. This is now the entire range of your data.

 

Step 4. Extract text from PDF inside the For Each loop from %PageStart#% to %PageEnd#% into %YourPDFTextVar%

 

Step 5. Use Parse text (%YourPDFTextVar%) and RegEx (YourRegExPattern) to find the text you want. Store the RegEx matches into a var of your choice, my first RegEx match is the member name, so mine is %Names%

 

Step 5a. For Each item in your %Names% trim out any junk data, extra string content, trim the result string %CurrentNames% and add the item %CurrentNames% to %YourListVar%.

 

Step 6. Repeat steps 4 and 5 or 5-5a until your data is satisfactory.

 

Step 7. Use a Loop from index 0 to %YourListVar.count - 1% (this will always be the upper bounds index value.)

 

Step 8. Inside the Loop perform an operation. In my case I am writing these values to an Excel book one at a time.

Step 8a (write to Excel book). Write to Excel worksheet %YourListVar[ListIndex]% (or any list value var(%YourListVar%) you want to cycle through,) in Column A and row %FirstFreeRow% of %ExcelInstance%

 

For any other uses, in your PAD Command block where it asks for a value enter %YourListVar[ListIndex]%

Example of iterate your list with Display Message box: "Message to display: %YourListVar[ListIndex]%

 

You will now see your Excel doc fill up with all the values from %(listvar)[ListIndex]%.

 

The purpose of the Loop is to get an index number of the item in question. The index number must always be a retrievable value from the target list. A list of 2 items (indexes 0,1) cannot be called upon for Index value 2, because Index value 2 is null.

 

%Item[RowNumber]%

 

Hope that helps anyone stuck with a similar issue.

View solution in original post

9 REPLIES 9
Henrik_M
Super User
Super User

That sounds like a job for the AI Builder:

AI Builder— Intelligent Automation | Microsoft Power Automate

Anonymous
Not applicable

I'll give it a swing and see what I can make happen with that.

 

Thank you very much, @Henrik_M !

UK_Mike
Post Prodigy
Post Prodigy

Not sure how complex the pdfs are but...

 

 

Screenshot 2022-04-08 175209.png

Anonymous
Not applicable

I got some RegEx expressions built to capture the desired data.

 

In investigating @Henrik_M 's proposed solution I discovered that the AI builder is a premium feature and that there's some set up required to get it off the ground and flying. I've already familiarized myself with PAD so I'll continue to work on my solution based in PAD.

 

I've created a flow that targets a folder, grabs all the .pdfs, searches the content of all the .pdfs in a text input given range of pages, then spits out all the found matches into two running lists.


I'm hoping you can aid me here on this one,

I need to now pair up the Match 1 (name) result with the Match 2 (date) result, I.E.: "Smitty McGee", "1/23/45"

 

I'm really new to manipulating strings and lists of text. Arrays have always baffled me, but I'm not completely unfamiliar with some of the concepts.

 

@UK_Mike , thank you for the encouragement. And the jokes! I needed that laugh really bad.

"

-member ID#, NAME - new line

-Claim number, member ID# again - new line

-Date of claim twice, claim type code, claim type code again, claim type code again again, ID number, number of units for this claim, dollar amount requested, dollar amount authorized, dollar amount deducted, dollar amount paid - new line...

"

 

Probably best to type out what these actually look like, such as...

 

Member ID: Mike

Claim type code: abc123

Date of claim: 9/4/22

Dollar amount requested: $100.98

ETC...

 

"

I'm hoping you can aid me here on this one,

I need to now pair up the Match 1 (name) result with the Match 2 (date) result, I.E.: "Smitty McGee", "1/23/45"

"

 

Each of these will be individual vars for us to write to Excel in a " For Each Loop " targeting the pdf folder.

As in each loop pulls 5,6,7 etc separate vars holding the values from the current item (pdf).

Basically they are already matched...

 

var 1 =  Mike

var 2  = abc123

var 3  = 9/4/22

var 4  = $100.98

 

Anonymous
Not applicable

Update:

 

I found the solution through RegEx and some highly specialized RegEx patterns developed by a friend for this particular case. For anyone with a similar need to take apart the contents of a PDF, then create a new variable out of the combined index value of each list of variables, please continue reading.

 

Step 1.: Get all files in designated target folder.

 

Step1.a: make sure your page ranges are good, I had to adjust mine to +1 page from page 1 (target page 2) and -3 pages from the last page (target page n-3, not the last three pages.)

 

Step 2. Create a list of the top level unique variable, mine is the member's name.

->Create new List: %YourListVar%

 

Step 3. Start a For Each loop, For %CurrentItem% in %Files%

 

Step 3a. Assign the PageStart# and PageEnd# to your page range variables. This is now the entire range of your data.

 

Step 4. Extract text from PDF inside the For Each loop from %PageStart#% to %PageEnd#% into %YourPDFTextVar%

 

Step 5. Use Parse text (%YourPDFTextVar%) and RegEx (YourRegExPattern) to find the text you want. Store the RegEx matches into a var of your choice, my first RegEx match is the member name, so mine is %Names%

 

Step 5a. For Each item in your %Names% trim out any junk data, extra string content, trim the result string %CurrentNames% and add the item %CurrentNames% to %YourListVar%.

 

Step 6. Repeat steps 4 and 5 or 5-5a until your data is satisfactory.

 

Step 7. Use a Loop from index 0 to %YourListVar.count - 1% (this will always be the upper bounds index value.)

 

Step 8. Inside the Loop perform an operation. In my case I am writing these values to an Excel book one at a time.

Step 8a (write to Excel book). Write to Excel worksheet %YourListVar[ListIndex]% (or any list value var(%YourListVar%) you want to cycle through,) in Column A and row %FirstFreeRow% of %ExcelInstance%

 

For any other uses, in your PAD Command block where it asks for a value enter %YourListVar[ListIndex]%

Example of iterate your list with Display Message box: "Message to display: %YourListVar[ListIndex]%

 

You will now see your Excel doc fill up with all the values from %(listvar)[ListIndex]%.

 

The purpose of the Loop is to get an index number of the item in question. The index number must always be a retrievable value from the target list. A list of 2 items (indexes 0,1) cannot be called upon for Index value 2, because Index value 2 is null.

 

%Item[RowNumber]%

 

Hope that helps anyone stuck with a similar issue.

Slightly different on my end.

After each loop I write to excel rather than holding the values in a list for later Excel write.

Lets say im after 10 values from each loop, if "ALL" values are populated to their respective variables they get written at the end of each loop plus the current pdf gets moved to a new folder " Processed ".

If just one value isnt found, skip current loop and that particular pdf gets moved to a new folder " Unprocessed ".

At the end of the flow " If unprocessed folder file count >=1 " than an email is sent to me calling me really really bad names 🙄

 

Nice write up though, well done 👏

Junzeng
New Member

We got claims(invoice) as a dispute with Walmart, million dollars level and it will not stop. The claim only has the total invoice amount plus long text BOL shipment ID in the invoice PDF layout, which BOL might be 180 rows across 5-6 pages. After extract to get BOL number, the next step will call HTTP to the Walmart website and download item details, such as each item refund money, item ID, date, etc. 

 

I almost get there, trigger email and copy PDF into SharePoint, convert PDF to extracted structured JSON object file(not array).

Parsing JSON successfully. Now, I need to run "FOR EACH" to loop/get this BOL number.

 

I am not a logic app expert, still learning those f(x) functions, and the project is very emergency, then I come here for asking a kind help.

 

I need "text" and "path", and this JSON file is object > elements (array) > attributes. Item() is not array, how to put elements[] array as output previous field as condition in FOR-EACH loop? Or my direction is wrong?

 

Thanks in advance if anyone can help.

takolota
Multi Super User
Multi Super User

You can also now use this template to extract data from PDFs without any Regex using GPT: 

https://powerusers.microsoft.com/t5/Power-Automate-Cookbook/Extract-Data-From-PDFs-and-Images-With-G...

Helpful resources

Announcements

Community Roundup: A Look Back at Our Last 10 Tuesday Tips

As we continue to grow and learn together, it's important to reflect on the valuable insights we've shared. For today's #TuesdayTip, we're excited to take a moment to look back at the last 10 tips we've shared in case you missed any or want to revisit them. Thanks for your incredible support for this series--we're so glad it was able to help so many of you navigate your community experience!   Getting Started in the Community An overview of everything you need to know about navigating the community on one page!  Community Links: ○ Power Apps ○ Power Automate  ○ Power Pages  ○ Copilot Studio    Community Ranks and YOU Have you ever wondered how your fellow community members ascend the ranks within our community? We explain everything about ranks and how to achieve points so you can climb up in the rankings! Community Links: ○ Power Apps ○ Power Automate  ○ Power Pages  ○ Copilot Studio    Powering Up Your Community Profile Your Community User Profile is how the Community knows you--so it's essential that it works the way you need it to! From changing your username to updating contact information, this Knowledge Base Article is your best resource for powering up your profile. Community Links: ○ Power Apps ○ Power Automate  ○ Power Pages  ○ Copilot Studio    Community Blogs--A Great Place to Start There's so much you'll discover in the Community Blogs, and we hope you'll check them out today!  Community Links: ○ Power Apps ○ Power Automate  ○ Power Pages  ○ Copilot Studio    Unlocking Community Achievements and Earning Badges Across the Communities, you'll see badges on users profile that recognize and reward their engagement and contributions. Check out some details on Community badges--and find out more in the detailed link at the end of the article! Community Links: ○ Power Apps  ○ Power Automate  ○ Power Pages  ○ Copilot Studio    Blogging in the Community Interested in blogging? Everything you need to know on writing blogs in our four communities! Get started blogging across the Power Platform communities today! Community Links: ○ Power Apps  ○ Power Automate  ○ Power Pages  ○ Copilot Studio   Subscriptions & Notifications We don't want you to miss a thing in the community! Read all about how to subscribe to sections of our forums and how to setup your notifications! Community Links: ○ Power Apps  ○ Power Automate  ○ Power Pages  ○ Copilot Studio   Getting Started with Private Messages & Macros Do you want to enhance your communication in the Community and streamline your interactions? One of the best ways to do this is to ensure you are using Private Messaging--and the ever-handy macros that are available to you as a Community member! Community Links: ○ Power Apps  ○ Power Automate  ○ Power Pages  ○ Copilot Studio   Community User Groups Learn everything about being part of, starting, or leading a User Group in the Power Platform Community. Community Links: ○ Power Apps  ○ Power Automate  ○ Power Pages  ○ Copilot Studio   Update Your Community Profile Today! Keep your community profile up to date which is essential for staying connected and engaged with the community. Community Links: ○ Power Apps  ○ Power Automate  ○ Power Pages  ○ Copilot Studio   Thank you for being an integral part of our journey.   Here's to many more Tuesday Tips as we pave the way for a brighter, more connected future! As always, watch the News & Announcements for the next set of tips, coming soon!    

Calling all User Group Leaders and Super Users! Mark Your Calendars for the next Community Ambassador Call on May 9th!

This month's Community Ambassador call is on May 9th at 9a & 3p PDT. Please keep an eye out in your private messages and Teams channels for your invitation. There are lots of exciting updates coming to the Community, and we have some exclusive opportunities to share with you! As always, we'll also review regular updates for User Groups, Super Users, and share general information about what's going on in the Community.     Be sure to register & we hope to see all of you there!

April 2024 Community Newsletter

We're pleased to share the April Community Newsletter, where we highlight the latest news, product releases, upcoming events, and the amazing work of our outstanding Community members.   If you're new to the Community, please make sure to follow the latest News & Announcements and check out the Community on LinkedIn as well! It's the best way to stay up-to-date with all the news from across Microsoft Power Platform and beyond.    COMMUNITY HIGHLIGHTS   Check out the most active community members of the last month! These hardworking members are posting regularly, answering questions, kudos, and providing top solutions in their communities. We are so thankful for each of you--keep up the great work! If you hope to see your name here next month, follow these awesome community members to see what they do!   Power AppsPower AutomateCopilot StudioPower PagesWarrenBelzDeenujialexander2523ragavanrajanLaurensMManishSolankiMattJimisonLucas001AmikcapuanodanilostephenrobertOliverRodriguestimlAndrewJManikandanSFubarmmbr1606VishnuReddy1997theMacResolutionsVishalJhaveriVictorIvanidzejsrandhawahagrua33ikExpiscornovusFGuerrero1PowerAddictgulshankhuranaANBExpiscornovusprathyooSpongYeNived_Nambiardeeksha15795apangelesGochixgrantjenkinsvasu24Mfon   LATEST NEWS   Business Applications Launch Event - On Demand In case you missed the Business Applications Launch Event, you can now catch up on all the announcements and watch the entire event on-demand inside Charles Lamanna's latest cloud blog.   This is your one stop shop for all the latest Copilot features across Power Platform and #Dynamics365, including first-hand looks at how companies such as Lenovo, Sonepar, Ford Motor Company, Omnicom and more are using these new capabilities in transformative ways. Click the image below to watch today!   Power Platform Community Conference 2024 is here! It's time to look forward to the next installment of the Power Platform Community Conference, which takes place this year on 18-20th September 2024 at the MGM Grand in Las Vegas!   Come and be inspired by Microsoft senior thought leaders and the engineers behind the #PowerPlatform, with Charles Lamanna, Sangya Singh, Ryan Cunningham, Kim Manis, Nirav Shah, Omar Aftab and Leon Welicki already confirmed to speak. You'll also be able to learn from industry experts and Microsoft MVPs who are dedicated to bridging the gap between humanity and technology. These include the likes of Lisa Crosbie, Victor Dantas, Kristine Kolodziejski, David Yack, Daniel Christian, Miguel Félix, and Mats Necker, with many more to be announced over the coming weeks.   Click here to watch our brand-new sizzle reel for #PPCC24 or click the image below to find out more about registration. See you in Vegas!       Power Up Program Announces New Video-Based Learning Hear from Principal Program Manager, Dimpi Gandhi, to discover the latest enhancements to the Microsoft #PowerUpProgram. These include a new accelerated video-based curriculum crafted with the expertise of Microsoft MVPs, Rory Neary and Charlie Phipps-Bennett. If you’d like to hear what’s coming next, click the image below to find out more!   UPCOMING EVENTS Microsoft Build - Seattle and Online - 21-23rd May 2024 Taking place on 21-23rd May 2024 both online and in Seattle, this is the perfect event to learn more about low code development, creating copilots, cloud platforms, and so much more to help you unleash the power of AI.   There's a serious wealth of talent speaking across the three days, including the likes of Satya Nadella, Amanda K. Silver, Scott Guthrie, Sarah Bird, Charles Lamanna, Miti J., Kevin Scott, Asha Sharma, Rajesh Jha, Arun Ulag, Clay Wesener, and many more.   And don't worry if you can't make it to Seattle, the event will be online and totally free to join. Click the image below to register for #MSBuild today!   European Collab Summit - Germany - 14-16th May 2024 The clock is counting down to the amazing European Collaboration Summit, which takes place in Germany May 14-16, 2024. #CollabSummit2024 is designed to provide cutting-edge insights and best practices into Power Platform, Microsoft 365, Teams, Viva, and so much more. There's a whole host of experts speakers across the three-day event, including the likes of Vesa Juvonen, Laurie Pottmeyer, Dan Holme, Mark Kashman, Dona Sarkar, Gavin Barron, Emily Mancini, Martina Grom, Ahmad Najjar, Liz Sundet, Nikki Chapple, Sara Fennah, Seb Matthews, Tobias Martin, Zoe Wilson, Fabian Williams, and many more.   Click the image below to find out more about #ECS2024 and register today!     Microsoft 365 & Power Platform Conference - Seattle - 3-7th June If you're looking to turbo boost your Power Platform skills this year, why not take a look at everything TechCon365 has to offer at the Seattle Convention Center on June 3-7, 2024.   This amazing 3-day conference (with 2 optional days of workshops) offers over 130 sessions across multiple tracks, alongside 25 workshops presented by Power Platform, Microsoft 365, Microsoft Teams, Viva, Azure, Copilot and AI experts. There's a great array of speakers, including the likes of Nirav Shah, Naomi Moneypenny, Jason Himmelstein, Heather Cook, Karuana Gatimu, Mark Kashman, Michelle Gilbert, Taiki Y., Kristi K., Nate Chamberlain, Julie Koesmarno, Daniel Glenn, Sarah Haase, Marc Windle, Amit Vasu, Joanne C Klein, Agnes Molnar, and many more.   Click the image below for more #Techcon365 intel and register today!     For more events, click the image below to visit the Microsoft Community Days website.      

Tuesday Tip | Update Your Community Profile Today!

It's time for another TUESDAY TIPS, your weekly connection with the most insightful tips and tricks that empower both newcomers and veterans in the Power Platform Community! Every Tuesday, we bring you a curated selection of the finest advice, distilled from the resources and tools in the Community. Whether you’re a seasoned member or just getting started, Tuesday Tips are the perfect compass guiding you across the dynamic landscape of the Power Platform Community.   We're excited to announce that updating your community profile has never been easier! Keeping your profile up to date is essential for staying connected and engaged with the community.   Check out the following Support Articles with these topics: Accessing Your Community ProfileRetrieving Your Profile URLUpdating Your Community Profile Time ZoneChanging Your Community Profile Picture (Avatar)Setting Your Date Display Preferences Click on your community link for more information: Power Apps, Power Automate, Power Pages, Copilot Studio   Thank you for being an active part of our community. Your contributions make a difference! Best Regards, The Community Management Team

Hear what's next for the Power Up Program

Hear from Principal Program Manager, Dimpi Gandhi, to discover the latest enhancements to the Microsoft #PowerUpProgram, including a new accelerated video-based curriculum crafted with the expertise of Microsoft MVPs, Rory Neary and Charlie Phipps-Bennett. If you’d like to hear what’s coming next, click the link below to sign up today! https://aka.ms/PowerUp  

Super User of the Month | Ahmed Salih

We're thrilled to announce that Ahmed Salih is our Super User of the Month for April 2024. Ahmed has been one of our most active Super Users this year--in fact, he kicked off the year in our Community with this great video reminder of why being a Super User has been so important to him!   Ahmed is the Senior Power Platform Architect at Saint Jude's Children's Research Hospital in Memphis. He's been a Super User for two seasons and is also a Microsoft MVP! He's celebrating his 3rd year being active in the Community--and he's received more than 500 kudos while authoring nearly 300 solutions. Ahmed's contributions to the Super User in Training program has been invaluable, with his most recent session with SUIT highlighting an incredible amount of best practices and tips that have helped him achieve his success.   Ahmed's infectious enthusiasm and boundless energy are a key reason why so many Community members appreciate how he brings his personality--and expertise--to every interaction. With all the solutions he provides, his willingness to help the Community learn more about Power Platform, and his sheer joy in life, we are pleased to celebrate Ahmed and all his contributions! You can find him in the Community and on LinkedIn. Congratulations, Ahmed--thank you for being a SUPER user!

Top Solution Authors
Users online (2,412)