cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
Anonymous
Not applicable

Need Help Automating PDF Text Extraction

This may not be the correct board. If so, please let me know and I'll make adjustments as necessary.

 

I work in an office that receives monthly invoices for our company's billing. We get our invoices in PDF style formatted to be continuous tables. The data is laid out kind of like so:

 

-member ID#, NAME - new line

-Claim number, member ID# again - new line

-Date of claim twice, claim type code, claim type code again, claim type code again again, ID number, number of units for this claim, dollar amount requested, dollar amount authorized, dollar amount deducted, dollar amount paid - new line...

 

Currently I'm reviewing these documents by hand with my little human eyeballs and fingers. It's not terribly slow but I have literally hundreds of these and each can contain upwards of 50+ claims.

 

Specifically my goal is this: (somehow) scan the PDF document w/ RegEx (or something to similar effect) and extract the text from the three lines of text demonstrated above while filtering out the "noise" (everything else). The desired data range will always start on page 2 and end on 3 pages from the last page of the PDF.

 

I'm using FileCenter to manage my documents. FileCenter pro provides you a neat tool to do some OCR text extraction similarly to what I'm desiring. I managed to set up a little "Demo" module to prove the concept and it worked in microcosm. I couldn't get it to export the data into a file or into a program.

 

Ultimately these extracted data need to be placed into an Excel document so I can do some other data related operations on them.

 

The target PDFs contain sensitive data so I can't share them.

 

Any ideas how to go from selecting a PDF to dropping specific text from the PDF into an Excel doc while pruning all the unnecessary stuff? (or a list object in PAD?)

1 ACCEPTED SOLUTION

Accepted Solutions
Anonymous
Not applicable

Update:

 

I found the solution through RegEx and some highly specialized RegEx patterns developed by a friend for this particular case. For anyone with a similar need to take apart the contents of a PDF, then create a new variable out of the combined index value of each list of variables, please continue reading.

 

Step 1.: Get all files in designated target folder.

 

Step1.a: make sure your page ranges are good, I had to adjust mine to +1 page from page 1 (target page 2) and -3 pages from the last page (target page n-3, not the last three pages.)

 

Step 2. Create a list of the top level unique variable, mine is the member's name.

->Create new List: %YourListVar%

 

Step 3. Start a For Each loop, For %CurrentItem% in %Files%

 

Step 3a. Assign the PageStart# and PageEnd# to your page range variables. This is now the entire range of your data.

 

Step 4. Extract text from PDF inside the For Each loop from %PageStart#% to %PageEnd#% into %YourPDFTextVar%

 

Step 5. Use Parse text (%YourPDFTextVar%) and RegEx (YourRegExPattern) to find the text you want. Store the RegEx matches into a var of your choice, my first RegEx match is the member name, so mine is %Names%

 

Step 5a. For Each item in your %Names% trim out any junk data, extra string content, trim the result string %CurrentNames% and add the item %CurrentNames% to %YourListVar%.

 

Step 6. Repeat steps 4 and 5 or 5-5a until your data is satisfactory.

 

Step 7. Use a Loop from index 0 to %YourListVar.count - 1% (this will always be the upper bounds index value.)

 

Step 8. Inside the Loop perform an operation. In my case I am writing these values to an Excel book one at a time.

Step 8a (write to Excel book). Write to Excel worksheet %YourListVar[ListIndex]% (or any list value var(%YourListVar%) you want to cycle through,) in Column A and row %FirstFreeRow% of %ExcelInstance%

 

For any other uses, in your PAD Command block where it asks for a value enter %YourListVar[ListIndex]%

Example of iterate your list with Display Message box: "Message to display: %YourListVar[ListIndex]%

 

You will now see your Excel doc fill up with all the values from %(listvar)[ListIndex]%.

 

The purpose of the Loop is to get an index number of the item in question. The index number must always be a retrievable value from the target list. A list of 2 items (indexes 0,1) cannot be called upon for Index value 2, because Index value 2 is null.

 

%Item[RowNumber]%

 

Hope that helps anyone stuck with a similar issue.

View solution in original post

9 REPLIES 9
Henrik_M
Super User
Super User

That sounds like a job for the AI Builder:

AI Builder— Intelligent Automation | Microsoft Power Automate

Anonymous
Not applicable

I'll give it a swing and see what I can make happen with that.

 

Thank you very much, @Henrik_M !

UK_Mike
Post Prodigy
Post Prodigy

Not sure how complex the pdfs are but...

 

 

Screenshot 2022-04-08 175209.png

Anonymous
Not applicable

I got some RegEx expressions built to capture the desired data.

 

In investigating @Henrik_M 's proposed solution I discovered that the AI builder is a premium feature and that there's some set up required to get it off the ground and flying. I've already familiarized myself with PAD so I'll continue to work on my solution based in PAD.

 

I've created a flow that targets a folder, grabs all the .pdfs, searches the content of all the .pdfs in a text input given range of pages, then spits out all the found matches into two running lists.


I'm hoping you can aid me here on this one,

I need to now pair up the Match 1 (name) result with the Match 2 (date) result, I.E.: "Smitty McGee", "1/23/45"

 

I'm really new to manipulating strings and lists of text. Arrays have always baffled me, but I'm not completely unfamiliar with some of the concepts.

 

@UK_Mike , thank you for the encouragement. And the jokes! I needed that laugh really bad.

"

-member ID#, NAME - new line

-Claim number, member ID# again - new line

-Date of claim twice, claim type code, claim type code again, claim type code again again, ID number, number of units for this claim, dollar amount requested, dollar amount authorized, dollar amount deducted, dollar amount paid - new line...

"

 

Probably best to type out what these actually look like, such as...

 

Member ID: Mike

Claim type code: abc123

Date of claim: 9/4/22

Dollar amount requested: $100.98

ETC...

 

"

I'm hoping you can aid me here on this one,

I need to now pair up the Match 1 (name) result with the Match 2 (date) result, I.E.: "Smitty McGee", "1/23/45"

"

 

Each of these will be individual vars for us to write to Excel in a " For Each Loop " targeting the pdf folder.

As in each loop pulls 5,6,7 etc separate vars holding the values from the current item (pdf).

Basically they are already matched...

 

var 1 =  Mike

var 2  = abc123

var 3  = 9/4/22

var 4  = $100.98

 

Anonymous
Not applicable

Update:

 

I found the solution through RegEx and some highly specialized RegEx patterns developed by a friend for this particular case. For anyone with a similar need to take apart the contents of a PDF, then create a new variable out of the combined index value of each list of variables, please continue reading.

 

Step 1.: Get all files in designated target folder.

 

Step1.a: make sure your page ranges are good, I had to adjust mine to +1 page from page 1 (target page 2) and -3 pages from the last page (target page n-3, not the last three pages.)

 

Step 2. Create a list of the top level unique variable, mine is the member's name.

->Create new List: %YourListVar%

 

Step 3. Start a For Each loop, For %CurrentItem% in %Files%

 

Step 3a. Assign the PageStart# and PageEnd# to your page range variables. This is now the entire range of your data.

 

Step 4. Extract text from PDF inside the For Each loop from %PageStart#% to %PageEnd#% into %YourPDFTextVar%

 

Step 5. Use Parse text (%YourPDFTextVar%) and RegEx (YourRegExPattern) to find the text you want. Store the RegEx matches into a var of your choice, my first RegEx match is the member name, so mine is %Names%

 

Step 5a. For Each item in your %Names% trim out any junk data, extra string content, trim the result string %CurrentNames% and add the item %CurrentNames% to %YourListVar%.

 

Step 6. Repeat steps 4 and 5 or 5-5a until your data is satisfactory.

 

Step 7. Use a Loop from index 0 to %YourListVar.count - 1% (this will always be the upper bounds index value.)

 

Step 8. Inside the Loop perform an operation. In my case I am writing these values to an Excel book one at a time.

Step 8a (write to Excel book). Write to Excel worksheet %YourListVar[ListIndex]% (or any list value var(%YourListVar%) you want to cycle through,) in Column A and row %FirstFreeRow% of %ExcelInstance%

 

For any other uses, in your PAD Command block where it asks for a value enter %YourListVar[ListIndex]%

Example of iterate your list with Display Message box: "Message to display: %YourListVar[ListIndex]%

 

You will now see your Excel doc fill up with all the values from %(listvar)[ListIndex]%.

 

The purpose of the Loop is to get an index number of the item in question. The index number must always be a retrievable value from the target list. A list of 2 items (indexes 0,1) cannot be called upon for Index value 2, because Index value 2 is null.

 

%Item[RowNumber]%

 

Hope that helps anyone stuck with a similar issue.

Slightly different on my end.

After each loop I write to excel rather than holding the values in a list for later Excel write.

Lets say im after 10 values from each loop, if "ALL" values are populated to their respective variables they get written at the end of each loop plus the current pdf gets moved to a new folder " Processed ".

If just one value isnt found, skip current loop and that particular pdf gets moved to a new folder " Unprocessed ".

At the end of the flow " If unprocessed folder file count >=1 " than an email is sent to me calling me really really bad names 🙄

 

Nice write up though, well done 👏

Junzeng
New Member

We got claims(invoice) as a dispute with Walmart, million dollars level and it will not stop. The claim only has the total invoice amount plus long text BOL shipment ID in the invoice PDF layout, which BOL might be 180 rows across 5-6 pages. After extract to get BOL number, the next step will call HTTP to the Walmart website and download item details, such as each item refund money, item ID, date, etc. 

 

I almost get there, trigger email and copy PDF into SharePoint, convert PDF to extracted structured JSON object file(not array).

Parsing JSON successfully. Now, I need to run "FOR EACH" to loop/get this BOL number.

 

I am not a logic app expert, still learning those f(x) functions, and the project is very emergency, then I come here for asking a kind help.

 

I need "text" and "path", and this JSON file is object > elements (array) > attributes. Item() is not array, how to put elements[] array as output previous field as condition in FOR-EACH loop? Or my direction is wrong?

 

Thanks in advance if anyone can help.

takolota
Multi Super User
Multi Super User

You can also now use this template to extract data from PDFs without any Regex using GPT: 

https://powerusers.microsoft.com/t5/Power-Automate-Cookbook/Extract-Data-From-PDFs-and-Images-With-G...

Helpful resources

Announcements

Celebrating a New Season of Super Users with Charles Lamanna, CVP Microsoft Business Applications

February 8 was the kickoff to the 2024 Season One Super User program for Power Platform Communities, and we are thrilled to welcome back so many returning Super Users--as well as so many brand new Super Users who started their journey last fall. Our Community Super Users are the true heroes, answering questions, providing solutions, filtering spam, and so much more. The impact they make on the Communities each day is significant, and we wanted to do something special to welcome them at our first kickoff meeting of the year.   Charles Lamanna, Microsoft CVP of Business Applications, has stressed frequently how valuable our Community is to the growth and potential of Power Platform, and we are honored to share this message from him to our 2024 Season One Super Users--as well as anyone who might be interested in joining this elite group of Community members.     If you want to know more about Super Users, check out these posts for more information today:    Power Apps: What is A Super User? - Power Platform CommunityPower Automate: What is A Super User? - Power Platform Community Copilot Studio: What is A Super User? - Power Platform Community Power Pages: What is A Super User? - Power Platform Community

February 2024 User Group Update: Welcoming New Groups and Highlighting Upcoming Events

It's a new month and a brand-new year, which means another opportunity to celebrate our amazing User Groups!Each month, we highlight the new User Groups that have joined the community. It's been a busy season for new groups, because we are thrilled to welcome 15 New User Groups! Take a look at the list below, shared by the different community categories. If your group is listed here, give this post a kudo so we can celebrate with you!   We love our User Groups and the difference they make in the lives of our Community! Thank you to all the new User Groups, new User Group leaders--we look forward to hearing about your successes and the impact you will leave!   In addition to our monthly User Group spotlight, it's a great time to share some of the latest events happening in our User Group community! Take a look at the list below to find one that fits your schedule and need! There's a great combination of in-person and virtual events to choose from. Also, don't forget to review the many events happening near you or virtually! It's a great time of year to connect and engage with User Groups both locally and online. Please Welcome Our NEW User Groups   Power Platform: Heathcare Power Platform User Group Power Platform Connect Hub Power Platform Usergroup Denmark Mexico Norte- Power Platform User Group Pune Power User Group Sudbury Power Platform User GroupMicrosoft User Group GhanaMPPBLR - Microsoft Power Platform Bengaluru User Group Power Apps:   Myrtle Beach Power Platform User GroupAnanseTechWB PowerApps Copilot Studio: Pathfinders Power Platform Community Dynamics365: Cairo, Egypt MSD 365 Business Central/NAV/F&O User GruopMS Dynamics 365 Business Central LatamCincinnati OH D365 F&O User Group February User Group Events February 2024 Cleveland Power Platform User GroupPortallunsj - Februar 2024Indiana D365/AX February User Group MeetingQ1 2024 KC Power Platform and Dynamics 365 CRM Users Group 

January 2024 Community Newsletter

Welcome to our January Newsletter, where we highlight the latest news, product releases, upcoming events, and the amazing work of our outstanding Community members. If you're new to the Community, please make sure to follow the latest News & Announcements in each Community and check out the Community on LinkedIn as well! It's the best way to stay up-to-date in 2024 with all the news from across Microsoft Power Platform and beyond.      COMMUNITY HIGHLIGHTS Check out the most active community members of the last month! These hardworking members are posting regularly, answering questions, giving (and receiving!) kudos, and consistently providing top solutions in their communities. We are so thankful for each of you--keep up the great work! If you hope to see your name here next month, make it your New Year's Resolution to be more active in the community in 2024.   Power AppsPower AutomateCopilot StudioPower PagesWarrenBelzWarrenBelzPstork1saudali_25LaurensMPstork1stephenrobertLucas001AARON_ClbendincpaytonSurendran_RANBNived_NambiarMariamPaulachanNikhil2JmanriqueriosANBJupyter123rodger-stmmbr1606Agniusstevesmith27mandelaPhineastrice602AnnaMoyalanOOlashynBCLS776grantjenkinsExpiscornovusJcookSpongYeAARON_CManishSolankiapangelesPstork1ManishSolankiSanju1Fubar   LATEST NEWS Power Platform 2024 Release Wave Highlights This month saw the 2024 Release Wave 1 plans for Microsoft Power Platform and Microsoft Dynamics 365- a compilation of new capabilities planned for release between April and September 2024. Click here to read Corporate Vice President Maureen (Mo) Osborne's detailed breakdown of the upcoming capabilities, and click the image below to check out some of the Power Platform 2024 Release Wave 1 highlights.     "What's New" Power Platform Shorts Series This month we also launched our brand-new 'Power Shorts' series on YouTube - a selection of super sweet snapshots to keep you in the loop with all the latest trends from across the Power Platform and beyond. Click the image below to check out the entire playlist so far, and don't forget to subscribe to our YouTube channel for all the latest updates.   Super User In Training (S.U.I.T) It was great to see the Power Platform Community officially kick off Season One of their Super User in Training (SUIT) program this month! Their first meeting saw an amazing turnout of over 300 enthusiastic participants who started their dynamic journey toward becoming Super Users. Huge thanks to Manas Maheshwari, Eric Archer, Heather Hernandez, and Duane Montague for a fantastic kick-off. The first meeting also saw seasoned Super User, Drew Poggemann, share invaluable insights on navigating the #MicrosoftCommunity with finesse. Many thanks to Drew for setting the stage and emphasizing the importance of active engagement and the art of providing thoughtful community solutions. If you want to learn more about the features and benefits of gaining Super User status, click the image below to find out more, and watch this space for more info about Season Two and how you can SUIT UP in the community!     UPCOMING EVENTS Microsoft 365 Community Day - Miami - February 1-2, 2024 It's not long now until the Microsoft 365 Community Day Miami, which will be taking place at the Wolfson Campus at Miami Dade College on 1-2 Feb. 2024. This free event is all about unlocking the full potential of Power Platform, Microsoft 365, and AI, so whether you’re a tech enthusiast, a business owner, or just curious about the latest Microsoft advancements, #M365Miami is for you.   The event is completely free and there will sessions in both English and Spanish to celebrate the vibrant and diverse make-up of our amazing community. Click the image below to join this amazing Community Day in Miami and become a part of our incredible network of learners and innovators!     Microsoft Fabric - Las Vegas - March 26-28, 2024 Exciting times ahead for the inaugural #MicrosoftFabric Community Conference on March 26-28 at the MGM Grand in Las Vegas. And if you book now, you can save $100 off registration! The Microsoft Fabric Conference will cover all the latest in analytics, AI, databases, and governance across 150+ sessions.   There will be a special Community Lounge onsite, interactive learning labs, plus you'll be able to 'Ask the Experts' all your questions to get help from data, analytics, and AI specialists, including community members and the Fabric Customer Advisory Team. Just add the code MSCUST when registering for a $100 discount today. Click the image below to find out more about the ultimate learning event for Microsoft Fabric!     Microsoft 365 Conference - Orlando - April 30 - May 2, 2024 Have you added The Microsoft 365 Conference to your community calendar yet? It happens this April 30th - May 2nd in Orlando, Florida. The 2024 Microsoft 365 Conference is one of the world’s largest gatherings of Microsoft engineers and community, with a strong focus on Power Platform, SharePoint, Azure, and the transition to an AI-powered modern workplace.   Click the image link below to find out more and be prepared to be enlightened, educated, and inspired at #M365Conf24!   LATEST COMMUNITY BLOG ARTICLES Power Apps Community Blog Power Automate Community Blog Copilot Studio Community Blog Power Pages Community Blog Check out 'Using the Community' for more helpful tips and information: Power Apps, Power Automate, Copilot Studio, Power Pages  

Super Users 2024 Season One is Here!

   We are excited to announce the first season of our 2024 Super Users is here! Our kickoff to the new year welcomes many returning Super Users and several new faces, and it's always exciting to see the impact these incredible individuals will have on the Community in 2024! We are so grateful for the daily difference they make in the Community already and know they will keep staying engaged and excited for all that will happen this year.   How to Spot a Super User in the Community:Have you ever written a post or asked for help in the Community and had it answered by a user with the Super User icon next to their name? It means you have found the actual, real-life superheroes of the Power Platform Community! Super Users are our heroes because of the way they consistently make a difference in the Community. Our amazing Super Users help keep the Community a safe place by flagging spam and letting the Community Managers know about issues. They also make the Community a great place to find answers, because they are often the first to offer solutions and get clarity on questions. Finally, Super Users share valuable insights on ways to keep the Community growing, engaging, and looking ahead!We are honored to reveal the new badges for this season of Super Users! Congratulations to all the new and returning Super Users!     To better answer the question "What is a Super User?" please check out this article: Power Apps: What is A Super User? - Power Platform CommunityPower Automate: What is A Super User? - Power Platform Community Copilot Studio: What is A Super User? - Power Platform Community Power Pages: What is A Super User? - Power Platform Community

Did You Attend the Microsoft Power Platform Conference in 2022 or 2023? Claim Your Badge Today!

If you were one of the thousands of people who joined us at the first #MPPC Microsoft Power Platform Conference in 2022 in Orlando--or attended the second-annual conference in Las Vegas in 2023--we are excited to honor you with a special community badge! Show your support for #MPPC Microsoft Power Platform Conference this year by claiming your badge!           Just follow this link to claim your badge for attending #MPPC in 2022 and/or 2023: MPPCBadgeRequest    Want to earn your badge for 2024? Just keep watching our News & Announcements for the latest updates on #MPPC24.

Microsoft Power Platform | 2024 Release Wave 1 Plan

Check out the latest Microsoft Power Platform release plans for 2024!   We have a whole host of exciting new features to help you be more productive, enhance delegation, run automated testing, build responsive pages, and so much more.    Click the links below to see not only our forthcoming releases, but to also try out some of the new features that have recently been released to market across:     Power Apps  Power Automate  Copilot Studio   We can’t wait to share with you all the upcoming releases that will help take your Power Platform experience to the next level!    Check out the entire Release Wave: Power Platform Complete Release Planner 

Top Solution Authors
Top Kudoed Authors
Users online (1,661)