cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
Anonymous
Not applicable

Need Help Automating PDF Text Extraction

This may not be the correct board. If so, please let me know and I'll make adjustments as necessary.

 

I work in an office that receives monthly invoices for our company's billing. We get our invoices in PDF style formatted to be continuous tables. The data is laid out kind of like so:

 

-member ID#, NAME - new line

-Claim number, member ID# again - new line

-Date of claim twice, claim type code, claim type code again, claim type code again again, ID number, number of units for this claim, dollar amount requested, dollar amount authorized, dollar amount deducted, dollar amount paid - new line...

 

Currently I'm reviewing these documents by hand with my little human eyeballs and fingers. It's not terribly slow but I have literally hundreds of these and each can contain upwards of 50+ claims.

 

Specifically my goal is this: (somehow) scan the PDF document w/ RegEx (or something to similar effect) and extract the text from the three lines of text demonstrated above while filtering out the "noise" (everything else). The desired data range will always start on page 2 and end on 3 pages from the last page of the PDF.

 

I'm using FileCenter to manage my documents. FileCenter pro provides you a neat tool to do some OCR text extraction similarly to what I'm desiring. I managed to set up a little "Demo" module to prove the concept and it worked in microcosm. I couldn't get it to export the data into a file or into a program.

 

Ultimately these extracted data need to be placed into an Excel document so I can do some other data related operations on them.

 

The target PDFs contain sensitive data so I can't share them.

 

Any ideas how to go from selecting a PDF to dropping specific text from the PDF into an Excel doc while pruning all the unnecessary stuff? (or a list object in PAD?)

1 ACCEPTED SOLUTION

Accepted Solutions
Anonymous
Not applicable

Update:

 

I found the solution through RegEx and some highly specialized RegEx patterns developed by a friend for this particular case. For anyone with a similar need to take apart the contents of a PDF, then create a new variable out of the combined index value of each list of variables, please continue reading.

 

Step 1.: Get all files in designated target folder.

 

Step1.a: make sure your page ranges are good, I had to adjust mine to +1 page from page 1 (target page 2) and -3 pages from the last page (target page n-3, not the last three pages.)

 

Step 2. Create a list of the top level unique variable, mine is the member's name.

->Create new List: %YourListVar%

 

Step 3. Start a For Each loop, For %CurrentItem% in %Files%

 

Step 3a. Assign the PageStart# and PageEnd# to your page range variables. This is now the entire range of your data.

 

Step 4. Extract text from PDF inside the For Each loop from %PageStart#% to %PageEnd#% into %YourPDFTextVar%

 

Step 5. Use Parse text (%YourPDFTextVar%) and RegEx (YourRegExPattern) to find the text you want. Store the RegEx matches into a var of your choice, my first RegEx match is the member name, so mine is %Names%

 

Step 5a. For Each item in your %Names% trim out any junk data, extra string content, trim the result string %CurrentNames% and add the item %CurrentNames% to %YourListVar%.

 

Step 6. Repeat steps 4 and 5 or 5-5a until your data is satisfactory.

 

Step 7. Use a Loop from index 0 to %YourListVar.count - 1% (this will always be the upper bounds index value.)

 

Step 8. Inside the Loop perform an operation. In my case I am writing these values to an Excel book one at a time.

Step 8a (write to Excel book). Write to Excel worksheet %YourListVar[ListIndex]% (or any list value var(%YourListVar%) you want to cycle through,) in Column A and row %FirstFreeRow% of %ExcelInstance%

 

For any other uses, in your PAD Command block where it asks for a value enter %YourListVar[ListIndex]%

Example of iterate your list with Display Message box: "Message to display: %YourListVar[ListIndex]%

 

You will now see your Excel doc fill up with all the values from %(listvar)[ListIndex]%.

 

The purpose of the Loop is to get an index number of the item in question. The index number must always be a retrievable value from the target list. A list of 2 items (indexes 0,1) cannot be called upon for Index value 2, because Index value 2 is null.

 

%Item[RowNumber]%

 

Hope that helps anyone stuck with a similar issue.

View solution in original post

9 REPLIES 9
Henrik_M
Super User
Super User

That sounds like a job for the AI Builder:

AI Builder— Intelligent Automation | Microsoft Power Automate

Anonymous
Not applicable

I'll give it a swing and see what I can make happen with that.

 

Thank you very much, @Henrik_M !

UK_Mike
Post Prodigy
Post Prodigy

Not sure how complex the pdfs are but...

 

 

Screenshot 2022-04-08 175209.png

Anonymous
Not applicable

I got some RegEx expressions built to capture the desired data.

 

In investigating @Henrik_M 's proposed solution I discovered that the AI builder is a premium feature and that there's some set up required to get it off the ground and flying. I've already familiarized myself with PAD so I'll continue to work on my solution based in PAD.

 

I've created a flow that targets a folder, grabs all the .pdfs, searches the content of all the .pdfs in a text input given range of pages, then spits out all the found matches into two running lists.


I'm hoping you can aid me here on this one,

I need to now pair up the Match 1 (name) result with the Match 2 (date) result, I.E.: "Smitty McGee", "1/23/45"

 

I'm really new to manipulating strings and lists of text. Arrays have always baffled me, but I'm not completely unfamiliar with some of the concepts.

 

@UK_Mike , thank you for the encouragement. And the jokes! I needed that laugh really bad.

"

-member ID#, NAME - new line

-Claim number, member ID# again - new line

-Date of claim twice, claim type code, claim type code again, claim type code again again, ID number, number of units for this claim, dollar amount requested, dollar amount authorized, dollar amount deducted, dollar amount paid - new line...

"

 

Probably best to type out what these actually look like, such as...

 

Member ID: Mike

Claim type code: abc123

Date of claim: 9/4/22

Dollar amount requested: $100.98

ETC...

 

"

I'm hoping you can aid me here on this one,

I need to now pair up the Match 1 (name) result with the Match 2 (date) result, I.E.: "Smitty McGee", "1/23/45"

"

 

Each of these will be individual vars for us to write to Excel in a " For Each Loop " targeting the pdf folder.

As in each loop pulls 5,6,7 etc separate vars holding the values from the current item (pdf).

Basically they are already matched...

 

var 1 =  Mike

var 2  = abc123

var 3  = 9/4/22

var 4  = $100.98

 

Anonymous
Not applicable

Update:

 

I found the solution through RegEx and some highly specialized RegEx patterns developed by a friend for this particular case. For anyone with a similar need to take apart the contents of a PDF, then create a new variable out of the combined index value of each list of variables, please continue reading.

 

Step 1.: Get all files in designated target folder.

 

Step1.a: make sure your page ranges are good, I had to adjust mine to +1 page from page 1 (target page 2) and -3 pages from the last page (target page n-3, not the last three pages.)

 

Step 2. Create a list of the top level unique variable, mine is the member's name.

->Create new List: %YourListVar%

 

Step 3. Start a For Each loop, For %CurrentItem% in %Files%

 

Step 3a. Assign the PageStart# and PageEnd# to your page range variables. This is now the entire range of your data.

 

Step 4. Extract text from PDF inside the For Each loop from %PageStart#% to %PageEnd#% into %YourPDFTextVar%

 

Step 5. Use Parse text (%YourPDFTextVar%) and RegEx (YourRegExPattern) to find the text you want. Store the RegEx matches into a var of your choice, my first RegEx match is the member name, so mine is %Names%

 

Step 5a. For Each item in your %Names% trim out any junk data, extra string content, trim the result string %CurrentNames% and add the item %CurrentNames% to %YourListVar%.

 

Step 6. Repeat steps 4 and 5 or 5-5a until your data is satisfactory.

 

Step 7. Use a Loop from index 0 to %YourListVar.count - 1% (this will always be the upper bounds index value.)

 

Step 8. Inside the Loop perform an operation. In my case I am writing these values to an Excel book one at a time.

Step 8a (write to Excel book). Write to Excel worksheet %YourListVar[ListIndex]% (or any list value var(%YourListVar%) you want to cycle through,) in Column A and row %FirstFreeRow% of %ExcelInstance%

 

For any other uses, in your PAD Command block where it asks for a value enter %YourListVar[ListIndex]%

Example of iterate your list with Display Message box: "Message to display: %YourListVar[ListIndex]%

 

You will now see your Excel doc fill up with all the values from %(listvar)[ListIndex]%.

 

The purpose of the Loop is to get an index number of the item in question. The index number must always be a retrievable value from the target list. A list of 2 items (indexes 0,1) cannot be called upon for Index value 2, because Index value 2 is null.

 

%Item[RowNumber]%

 

Hope that helps anyone stuck with a similar issue.

Slightly different on my end.

After each loop I write to excel rather than holding the values in a list for later Excel write.

Lets say im after 10 values from each loop, if "ALL" values are populated to their respective variables they get written at the end of each loop plus the current pdf gets moved to a new folder " Processed ".

If just one value isnt found, skip current loop and that particular pdf gets moved to a new folder " Unprocessed ".

At the end of the flow " If unprocessed folder file count >=1 " than an email is sent to me calling me really really bad names 🙄

 

Nice write up though, well done 👏

Junzeng
New Member

We got claims(invoice) as a dispute with Walmart, million dollars level and it will not stop. The claim only has the total invoice amount plus long text BOL shipment ID in the invoice PDF layout, which BOL might be 180 rows across 5-6 pages. After extract to get BOL number, the next step will call HTTP to the Walmart website and download item details, such as each item refund money, item ID, date, etc. 

 

I almost get there, trigger email and copy PDF into SharePoint, convert PDF to extracted structured JSON object file(not array).

Parsing JSON successfully. Now, I need to run "FOR EACH" to loop/get this BOL number.

 

I am not a logic app expert, still learning those f(x) functions, and the project is very emergency, then I come here for asking a kind help.

 

I need "text" and "path", and this JSON file is object > elements (array) > attributes. Item() is not array, how to put elements[] array as output previous field as condition in FOR-EACH loop? Or my direction is wrong?

 

Thanks in advance if anyone can help.

takolota
Super User
Super User

You can also now use this template to extract data from PDFs without any Regex using GPT: 

https://powerusers.microsoft.com/t5/Power-Automate-Cookbook/Extract-Data-From-PDFs-and-Images-With-G...

Helpful resources

Announcements

Back to Basics: Tuesday Tip #1: All About YOUR Community Account

We are excited to kick off our new #TuesdayTIps series, "Back to Basics." This weekly series is our way of helping the amazing members of our community--both new members and seasoned veterans--learn and grow in how to best engage in the community! Each Tuesday, we will feature new areas of content that will help you best understand the community--from ranking and badges to profile avatars, from Super Users to blogging in the community. Our hope is that this information will help each of our community members grow in their experience with Power Platform, with the community, and with each other!     This Week's Tips: Account Support: Changing Passwords, Changing Email Addresses or Usernames, "Need Admin Approval," Etc.Wondering how to get support for your community account? Check out the details on these common questions and more. Just follow the link below for articles that explain it all.Community Account Support - Power Platform Community (microsoft.com)   All About GDPR: How It Affects Closing Your Community Account (And Why You Should Think Twice Before You Do)GDPR, the General Data Protection Regulation (GDPR), took effect May 25th 2018. A European privacy law, GDPR imposes new rules on companies and other organizations offering goods and services to people in the European Union (EU), or that collect and analyze data tied to EU residents. GDPR applies no matter where you are located, and it affects what happens when you decide to close your account. Read the details here:All About GDPR - Power Platform Community (microsoft.com)   Getting to Know You: Setting Up Your Community Profile, Customizing Your Profile, and More.Your community profile helps other members of the community get to know you as you begin to engage and interact. Your profile is a mirror of your activity in the community. Find out how to set it up, change your avatar, adjust your time zone, and more. Click on the link below to find out how:Community Profile, Time Zone, Picture (Avatar) & D... - Power Platform Community (microsoft.com)   That's it for this week. Tune in for more Tuesday Tips next Tuesday and join the community as we get "Back to Basics."

Announcing the MPPC's Got Power Talent Show at #MPPC23

Are you attending the Microsoft Power Platform Conference 2023 in Las Vegas? If so, we invite you to join us for the MPPC's Got Power Talent Show!      Our talent show is more than a show—it's a grand celebration of connection, inspiration, and shared journeys. Through stories, skills, and collective experiences, we come together to uplift, inspire, and revel in the magic of our community's diverse talents. This year, our talent event promises to be an unforgettable experience, echoing louder and brighter than anything you've seen before.    We're casting a wider net with three captivating categories:  Demo Technical Solutions: Show us your Power Platform innovations, be it apps, flows, chatbots, websites or dashboards... Storytelling: Share tales of your journey with Power Platform. Hidden Talents: Unveil your creative side—be it dancing, singing, rapping, poetry, or comedy. Let your talent shine!    Got That Special Spark? A Story That Demands to Be Heard? Your moment is now!  Sign up to Showcase Your Brilliance: https://aka.ms/MPPCGotPowerSignUp  Deadline for submissions: Thursday, Sept 28th    How It Works:  Submit this form to sign up: https://aka.ms/MPPCGotPowerSignUp  We'll contact you if you're selected. Get ready to be onstage!  The Spotlight is Yours: Each participant has 3-5 minutes to shine, with insightful commentary from our panel of judges. We’re not just giving you a stage; we’re handing you the platform to make your mark.     Be the Story We Tell: Your talents and narratives will not just entertain but inspire, serving as the bedrock for our community’s future stories and successes.    Celebration, Surprises, and Connections: As the curtain falls, the excitement continues! Await surprise awards and seize the chance to mingle with industry experts, Microsoft Power Platform leaders, and community luminaries. It's not just a show; it's an opportunity to forge connections and celebrate shared successes.    Event Details:  Date and Time: Wed Oct 4th, 6:30-9:00PM   Location: MPPC23 at the MGM Grand, Las Vegas, NV, USA  

September User Group Success Story: Reading Dynamics 365 & Power Platform User Group

The Reading Dynamics 365 and Power Platform User Group is a community-driven initiative that started in September 2022. It has quickly earned recognition for its enthusiastic leadership and resilience in the face of challenges. With a focus on promoting learning and networking among professionals in the Dynamics 365 and Power Platform ecosystem, the group has grown steadily and gained a reputation for its commitment to its members!   The group, which had its inaugural event in January 2023 at the Microsoft UK Headquarters in Reading, has since organized three successful gatherings, including a recent social lunch. They maintain a regular schedule of four events per year, each attended by an average of 20-25 enthusiastic participants who enjoy engaging talks and, of course, pizza.   The Reading User Group's presence is primarily spread through LinkedIn and Meetup, with the support of the wider community. This thriving community is managed by a dedicated team consisting of Fraser Dear, Tim Leung, and Andrew Bibby, who serves as the main point of contact for the UK Dynamics 365 and Power Platform User Groups.   Andrew Bibby, an active figure in the Dynamics 365 and Power Platform community, nominated this group due to his admiration for the Reading UK User Group's efforts. He emphasized their remarkable enthusiasm and success in running the group, noting that they navigated challenges such as finding venues with resilience and smiles on their faces. Despite being a relatively new group with 20-30 members, they have managed to achieve high attendance at their meetings.   The group's journey began when Fraser Dear moved to the Reading area and realized the absence of a user group catering to professionals in the Dynamics 365 and Power Platform space. He reached out to Andrew, who provided valuable guidance and support, allowing the Reading User Group to officially join the UK Dynamics 365 and Power Platform User Groups community.   One of the group's notable achievements was overcoming the challenge of finding a suitable venue. Initially, their "home" was the Microsoft UK HQ in Reading. However, due to office closures, they had to seek a new location with limited time. Fortunately, a connection with Stephanie Stacey from Microsoft led them to Reading College and its Institute of Technology. The college generously offered them event space and support, forging a mutually beneficial partnership where the group promotes the Institute and encourages its members to support the next generation of IT professionals.   With the dedication of its leadership team, the Reading Dynamics 365 and Power Platform User Group is poised to continue growing and thriving! Their story exemplifies the power of community-driven initiatives and the positive impact they can have on professional development and networking in the tech industry. As they move forward with their upcoming events and collaborations with Reading College, the group is likely to remain a valuable resource for professionals in the Reading area and beyond.  

A Celebration of What We've Achieved--And Announcing Our Winners

As the sun sets on the #SummerofSolutions Challenge, it's time to reflect and celebrate! The journey we embarked upon together was not just about providing answers – it was about fostering a sense of community, encouraging collaboration, and unlocking the true potential of the Power Platform tools.   From the initial announcement to the final week's push, the Summer of Solutions Challenge has been a whirlwind of engagement and growth. It was a call to action for every member of our Power Platform community, urging them to contribute their expertise, engage in discussions, and elevate collective knowledge across the community as part of the low-code revolution.   Reflecting on the Impact As the challenge ends, it's essential to reflect on the impact it’s had across our Power Platform communities: Community Resilience: The challenge demonstrated the resilience of our community. Despite geographical distances and diverse backgrounds, we came together to contribute, learn, and collaborate. This resilience is the cornerstone of our collective strength.Diverse Expertise: The solutions shared during the challenge underscore the incredible expertise within our community. From intricate technical insights to creative problem-solving, our members showcased their diverse skill sets, enhancing our community's depth.Shared Learning: Solutions spurred shared learning. They provided opportunities for members to grasp new concepts, expand their horizons, and uncover the Power Platform tools' untapped potential. This learning ripple effect will continue to shape our growth. Empowerment: Solutions empowered community members. They validated their knowledge, boosted their confidence, and highlighted their contributions. Each solution shared was a step towards personal and communal empowerment. We are proud and thankful as we conclude the Summer of Solutions Challenge. The challenge showed the potential of teamwork, the benefit of knowledge-sharing, and the resilience of our Power Platform community. The solutions offered by each member are more than just answers; they are the expression of our shared commitment to innovation, growth, and progress!     Drum roll, Please... And now, without further ado, it's time to announce the winners who have risen above the rest in the Summer of Solutions Challenge!   These are the top community users and Super Users who have not only earned recognition but have become beacons of inspiration for us all.   Power Apps Community:  Community User Winner: @SpongYe Super User Winner: Pending Acceptance Power Automate Community:  Community User Winner: @trice602 Super User Winner: @Expiscornovus  Power Virtual Agents Community: Community User Winner: Pending AcceptanceSuper User: Pending Acceptance Power Pages Community: Community User Winner: @OOlashyn Super User Winner: @ChristianAbata   We are also pleased to announced two additional tickets that we are awarding to the Overall Top Solution providers in the following communities:    Power Apps: @LaurensM   Power Automate: @ManishSolanki    Thank you for making this challenge a resounding success. Your participation has reaffirmed the strength of our community and the boundless potential that lies within each of us. Let's keep the spirit of collaboration alive as we continue on this incredible journey in Power Platform together.Winners, we will see you in Vegas! Every other amazing solutions superstar, we will see you in the Community!Congratulations, everyone!

September featured user group leader

 Ayonija Shatakshi, a seasoned senior consultant at Improving, Ohio, is a passionate advocate for M365, SharePoint, Power Platform, and Azure, recognizing how they synergize to deliver top-notch solutions. Recently, we asked Ayonija to share her journey as a user group leader, shedding light on her motivations and the benefits she's reaped from her community involvement.      Ayonija embarked on her role as a user group leader in December 2022, driven by a desire to explore how the community leveraged various Power Platform components. When she couldn't find a suitable local group, she decided to create one herself!    Speaking about the impact of the community on her professional and personal growth, Ayonija says, "It's fascinating to witness how everyone navigates the world of Power Platform, dealing with license constraints and keeping up with new features. There's so much to learn from their experiences.:        Her favorite aspect of being a user group leader is the opportunity to network and engage in face-to-face discussions with fellow enthusiasts, fostering deeper connections within the community. Offering advice to budding user group leaders, Ayonija emphasized the importance of communication and consistency, two pillars that sustain any successful community initiative.      When asked why she encourages others to become user group leaders, Ayonija said, "Being part of a user group is one of the best ways to connect with experienced professionals in the same field and glean knowledge from them. If there isn't a local group, consider starting one; you'll soon find like-minded individuals."      Her highlight from the past year as a user group leader was witnessing consistent growth within the group, a testament to the thriving community she has nurtured. Advocating for user group participation, Ayonija stated, "It's the fastest route to learning from the community, gaining insights, and staying updated on industry trends."   Check out her group: Cleveland Power Platform User Group

An MPPC23 Invitation from Charles Lamanna, CVP of Microsoft Business Applications & Platform

Hear from Corporate Vice President for Microsoft Business Applications & Platform, Charles Lamanna, as he looks ahead to the second annual Microsoft Power Platform Conference from October 3rd-5th 2023 at the MGM Grand in Las Vegas.Have you got your tickets yet? Register today at www.powerplatformconf.com  

Top Solution Authors
Top Kudoed Authors
Users online (1,652)