cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
naelaiman
Frequent Visitor

Extract Text From Structured PDF

Hi RPA Community,

I have this PDF file that i want to extract its text. The PDF will be in a structured form and the output text file should follow the structure accordingly. Can someone give advice on which approach should i use in order to get the correct output.

 

I will share you the sample PDF and the desired text format once extracted

https://pktgroup-my.sharepoint.com/:u:/p/nael_rashid/EfGkG-1n0KRAuHfFlbuV0q8BgPINNKLcVgNOfzcUqXk2lw?... 

 

Appreciate your time and assistance,

Thanks and Regards,

Nael

 

21 REPLIES 21
yoko2020
Responsive Resident
Responsive Resident

Hi @yoko2020 ,

Thank you for your suggestion, does this mean i need to rely on AI builder or any other 3rd party software service in order to get the extracted text for my situation?

I was hoping that there is a way to get my desired output using commands that are available in PAD.

 

Thanks and Regards,

Nael

I never use parse/regex action or extract text from pdf action  from PAD when dealing with invoice, sales order, custom form document (pdf/image) extraction, always use third party software specialized for this purpose.

 

Things to consider when dealing with this stuff :

1. Does the document always come in text pdf ?

2. What happen if document come in image pdf ?

3. Are we dealing with  =>1000 of documents per month or just 10 documents per month ?

4. What if in 1 document contain multiple invoices that need to be separated ?

    See this video what i mean about document separation/invoice splitting 

     https://www.youtube.com/watch?v=9fFjQn_E8dI

5. And sometimes invoice contain multiple page, so we are facing dynamic invoice pages that need to be processed.

 

 

Most of this software can handle invoice splitting except power automate aibuilder.

If you only process small quantity you can try use internal PAD action, but make notice of those 5 points or else your project will stuck in the future.

 

 

 

 

 

Ahammad_Riyaz
Super User
Super User

Hi @yoko2020 

If the pdf is constant header means you can directly use regex.

first you need to us e action Extract text from PDF

After use parse text and use regex based on the required data.

 

Regards

Ahammad Riyaz

--------------------------------------------------------------------------------
If this post helps answer your question, please click on “Accept as Solution” to help other members find it more quickly. If you thought this post was helpful, please give it a Thumbs Up.

UK_Mike
Post Prodigy
Post Prodigy

"The PDF will be in a structured form and the output text file should follow the structure accordingly"

This makes no sense, surely your just pulling particular values rather than the whole pdf text ?

 

If particular values, yes it can be done wholly in PAD...

 

Screenshot 2022-03-29 140513.png

Hi @yoko2020 ,

I previously used regex and string manipulation to extract data from this pdf format. However, previously i used Automation Anywhere (AA) it can extract text in structured format so it was easy for me to extract the data line by line with string conditions. Right now I had to migrate to PAD so i find the extract text is not the same as AA and i find that the result is different than i expected. I will share you the output i got from using PAD extract pdf to text command.

https://pktgroup-my.sharepoint.com/:t:/p/nael_rashid/EZ0SpMbDJKhNuXr_s5REAQMB5J-syMWQw4s519g602_dRw?...

Let me answer your question:-

1- Yes, for this file it will always come in readable pdf format as it is generated by a system.

2- If there is an image pdf file during extraction it will extract nothing so an error handling should be able to overcome that.

3- Yes, we are dealing with 1000+ documents per month.

4- No, there wont be any invoice combined together as the system will generate 1 invoice per order.

5- If there is multiple page i should still be able to extract all the necessary information if the text output is in a structured format.

 

Thanks and Regards,

Nael

Hi @Ahammad_Riyaz ,

Yes the pdf will have constant header and will repeat if there is multiple page of the invoice. I tried this approach but however the result i get from "Extract PDF to Text" is hard to implement the regex or string operations. This is probably due to the format of the pdf that's why the text result is cluttered and not in pair. I share with you the output i got from using the PAD Extract PDF to Text. https://pktgroup-my.sharepoint.com/:t:/p/nael_rashid/EZ0SpMbDJKhNuXr_s5REAQMB5J-syMWQw4s519g602_dRw?...

Is there any way for me to extract the text without using 3rd party applications or AI-builder for this?

 

Thanks and Regards,

Nael

Hi @UK_Mike ,

Yes i wanted to pull particular values, but the result i get from Extract PDF to Text is not organized and applying regex or string manipulation can be difficult as the value doesn't seem to be coming in pairs for this PDF file. If the text extracted is written by following the same format as the PDF then it is possible for me to extract the invoice details as well as the item details. I share with you the text output i get from PAD. As you can see, each value is hard to differentiate. https://pktgroup-my.sharepoint.com/:t:/p/nael_rashid/EZ0SpMbDJKhNuXr_s5REAQMB5J-syMWQw4s519g602_dRw?...

Plenty of field that i want to extract and evaluate and if the extracted text is coming in this format i can be quite troublesome for me to extract the details. I was hoping to find a solution for this without relying on 3rd party applications or AI-Builder as additional cost will incur and there are over 1000+ PDF invoices needed to be processed per month.

 

Thanks and Regards,

Nael

@Ahammad_Riyaz 

 

Yes i know that technique very well.

But i never use that method, wasting of time and double work in the future when dealing with >2000 document and 100 vendor (document layout)

@naelaiman 

 

what PAD version you use ? looks like action extract text from pdf has a bug, it does not keep indentation.

In that use you have to go for AI Builder, you can train and use for different vendor.

 

 

Yes, already finish long time a go, but not using AI Builder.

Well I extract the pdfs to a variable, not sure what you're extracting to ?

Line numbers cannot be relied on, one pdf could have the invoice date on line 20, next pdf on line 21, too unreliable.

When I have the read pdf text in a variable I then parse this variable with regex.

One regex per value required, Date, Amount, Customer etc.

The resulting variables from the parse then get written to Excel.

Im sure others have their own way of doing it but im trying to avoid Ai builder or any 3rd parties.

I see in the latest PAD update, pdf tables are catered for, not tried it yet, scared of updating 😂

I have a look now and again on the AI Builder forum, it is not the Holy Grail of Pdf data extraction.

I see @Ahammad_Riyaz  referred to this method, it works for me.

 

@yoko2020  , care to elaborate what software you use, more than 1, costs ?

 

Mike

 

 

 

 

 

yoko2020
Responsive Resident
Responsive Resident

@UK_Mike 

 

Chronoscan Advanced version + Nuance Plug-Ins https://www.chronoscan.org/

ABBYY® FlexiCapture® https://www.abbyy.com/flexicapture/

Artsyl’s docAlpha https://www.artsyltech.com/products/docAlpha

 

 

 

 

 

 

 

 

 

Sorry @yoko2020 , as soon as I posted I seen your post further on up mentioning the software you used.

Thanks x 2 😂

yoko2020
Responsive Resident
Responsive Resident

np.

 

You also can test using this 2 services if you want getting a headache every 1 minute. 😂

 

Azure Form Recognizer
Amazon Textract

Ermmmmmmmmmmmm...................... no thanks 😂

Hi @yoko2020 ,

I'm using PAD version 2.18.146.22083 free version not licensed, If this is a bug then i can get support for this issue. Can you share me your extract PDF to text result that you get from my file that i shared?

 

Thanks and Regards,

Nael

 

Hi @UK_Mike ,

For my situation, I'm still in researching phase of this project. Trying out the PDF extraction command. Previously i used Automation Anywhere (AA), the extract pdf to text wont store the extracted data to a variable however it writes on a text file. I can extract the details line by line and create conditions based on the counter variable.

What i can see when i was using PAD the extract pdf to text command will extract all details onto a variable. From that variable i can convert it to a list and begin my regex and string operations to find the necessary details in each line. If i am not mistaken do correct me if i'm wrong.

Just that the issue is when i use the extract PDF to text in PAD, the indentation of the pdf file is removed and is placed on a new line. This makes the data inconsistent even if all the pdf is in the same format.

Does this issue only happen to me can you share your Extract PDF to Text result from PAD so that i can confirm my situation?

 

Thanks and Regards,

Nael

Helpful resources

Announcements

Calling all User Group Leaders and Super Users! Mark Your Calendars for the next Community Ambassador Call on May 9th!

This month's Community Ambassador call is on May 9th at 9a & 3p PDT. Please keep an eye out in your private messages and Teams channels for your invitation. There are lots of exciting updates coming to the Community, and we have some exclusive opportunities to share with you! As always, we'll also review regular updates for User Groups, Super Users, and share general information about what's going on in the Community.     Be sure to register & we hope to see all of you there!

April 2024 Community Newsletter

We're pleased to share the April Community Newsletter, where we highlight the latest news, product releases, upcoming events, and the amazing work of our outstanding Community members.   If you're new to the Community, please make sure to follow the latest News & Announcements and check out the Community on LinkedIn as well! It's the best way to stay up-to-date with all the news from across Microsoft Power Platform and beyond.    COMMUNITY HIGHLIGHTS   Check out the most active community members of the last month! These hardworking members are posting regularly, answering questions, kudos, and providing top solutions in their communities. We are so thankful for each of you--keep up the great work! If you hope to see your name here next month, follow these awesome community members to see what they do!   Power AppsPower AutomateCopilot StudioPower PagesWarrenBelzDeenujialexander2523ragavanrajanLaurensMManishSolankiMattJimisonLucas001AmikcapuanodanilostephenrobertOliverRodriguestimlAndrewJManikandanSFubarmmbr1606VishnuReddy1997theMacResolutionsVishalJhaveriVictorIvanidzejsrandhawahagrua33ikExpiscornovusFGuerrero1PowerAddictgulshankhuranaANBExpiscornovusprathyooSpongYeNived_Nambiardeeksha15795apangelesGochixgrantjenkinsvasu24Mfon   LATEST NEWS   Business Applications Launch Event - On Demand In case you missed the Business Applications Launch Event, you can now catch up on all the announcements and watch the entire event on-demand inside Charles Lamanna's latest cloud blog.   This is your one stop shop for all the latest Copilot features across Power Platform and #Dynamics365, including first-hand looks at how companies such as Lenovo, Sonepar, Ford Motor Company, Omnicom and more are using these new capabilities in transformative ways. Click the image below to watch today!   Power Platform Community Conference 2024 is here! It's time to look forward to the next installment of the Power Platform Community Conference, which takes place this year on 18-20th September 2024 at the MGM Grand in Las Vegas!   Come and be inspired by Microsoft senior thought leaders and the engineers behind the #PowerPlatform, with Charles Lamanna, Sangya Singh, Ryan Cunningham, Kim Manis, Nirav Shah, Omar Aftab and Leon Welicki already confirmed to speak. You'll also be able to learn from industry experts and Microsoft MVPs who are dedicated to bridging the gap between humanity and technology. These include the likes of Lisa Crosbie, Victor Dantas, Kristine Kolodziejski, David Yack, Daniel Christian, Miguel Félix, and Mats Necker, with many more to be announced over the coming weeks.   Click here to watch our brand-new sizzle reel for #PPCC24 or click the image below to find out more about registration. See you in Vegas!       Power Up Program Announces New Video-Based Learning Hear from Principal Program Manager, Dimpi Gandhi, to discover the latest enhancements to the Microsoft #PowerUpProgram. These include a new accelerated video-based curriculum crafted with the expertise of Microsoft MVPs, Rory Neary and Charlie Phipps-Bennett. If you’d like to hear what’s coming next, click the image below to find out more!   UPCOMING EVENTS Microsoft Build - Seattle and Online - 21-23rd May 2024 Taking place on 21-23rd May 2024 both online and in Seattle, this is the perfect event to learn more about low code development, creating copilots, cloud platforms, and so much more to help you unleash the power of AI.   There's a serious wealth of talent speaking across the three days, including the likes of Satya Nadella, Amanda K. Silver, Scott Guthrie, Sarah Bird, Charles Lamanna, Miti J., Kevin Scott, Asha Sharma, Rajesh Jha, Arun Ulag, Clay Wesener, and many more.   And don't worry if you can't make it to Seattle, the event will be online and totally free to join. Click the image below to register for #MSBuild today!   European Collab Summit - Germany - 14-16th May 2024 The clock is counting down to the amazing European Collaboration Summit, which takes place in Germany May 14-16, 2024. #CollabSummit2024 is designed to provide cutting-edge insights and best practices into Power Platform, Microsoft 365, Teams, Viva, and so much more. There's a whole host of experts speakers across the three-day event, including the likes of Vesa Juvonen, Laurie Pottmeyer, Dan Holme, Mark Kashman, Dona Sarkar, Gavin Barron, Emily Mancini, Martina Grom, Ahmad Najjar, Liz Sundet, Nikki Chapple, Sara Fennah, Seb Matthews, Tobias Martin, Zoe Wilson, Fabian Williams, and many more.   Click the image below to find out more about #ECS2024 and register today!     Microsoft 365 & Power Platform Conference - Seattle - 3-7th June If you're looking to turbo boost your Power Platform skills this year, why not take a look at everything TechCon365 has to offer at the Seattle Convention Center on June 3-7, 2024.   This amazing 3-day conference (with 2 optional days of workshops) offers over 130 sessions across multiple tracks, alongside 25 workshops presented by Power Platform, Microsoft 365, Microsoft Teams, Viva, Azure, Copilot and AI experts. There's a great array of speakers, including the likes of Nirav Shah, Naomi Moneypenny, Jason Himmelstein, Heather Cook, Karuana Gatimu, Mark Kashman, Michelle Gilbert, Taiki Y., Kristi K., Nate Chamberlain, Julie Koesmarno, Daniel Glenn, Sarah Haase, Marc Windle, Amit Vasu, Joanne C Klein, Agnes Molnar, and many more.   Click the image below for more #Techcon365 intel and register today!     For more events, click the image below to visit the Microsoft Community Days website.      

Tuesday Tip | Update Your Community Profile Today!

It's time for another TUESDAY TIPS, your weekly connection with the most insightful tips and tricks that empower both newcomers and veterans in the Power Platform Community! Every Tuesday, we bring you a curated selection of the finest advice, distilled from the resources and tools in the Community. Whether you’re a seasoned member or just getting started, Tuesday Tips are the perfect compass guiding you across the dynamic landscape of the Power Platform Community.   We're excited to announce that updating your community profile has never been easier! Keeping your profile up to date is essential for staying connected and engaged with the community.   Check out the following Support Articles with these topics: Accessing Your Community ProfileRetrieving Your Profile URLUpdating Your Community Profile Time ZoneChanging Your Community Profile Picture (Avatar)Setting Your Date Display Preferences Click on your community link for more information: Power Apps, Power Automate, Power Pages, Copilot Studio   Thank you for being an active part of our community. Your contributions make a difference! Best Regards, The Community Management Team

Hear what's next for the Power Up Program

Hear from Principal Program Manager, Dimpi Gandhi, to discover the latest enhancements to the Microsoft #PowerUpProgram, including a new accelerated video-based curriculum crafted with the expertise of Microsoft MVPs, Rory Neary and Charlie Phipps-Bennett. If you’d like to hear what’s coming next, click the link below to sign up today! https://aka.ms/PowerUp  

Super User of the Month | Ahmed Salih

We're thrilled to announce that Ahmed Salih is our Super User of the Month for April 2024. Ahmed has been one of our most active Super Users this year--in fact, he kicked off the year in our Community with this great video reminder of why being a Super User has been so important to him!   Ahmed is the Senior Power Platform Architect at Saint Jude's Children's Research Hospital in Memphis. He's been a Super User for two seasons and is also a Microsoft MVP! He's celebrating his 3rd year being active in the Community--and he's received more than 500 kudos while authoring nearly 300 solutions. Ahmed's contributions to the Super User in Training program has been invaluable, with his most recent session with SUIT highlighting an incredible amount of best practices and tips that have helped him achieve his success.   Ahmed's infectious enthusiasm and boundless energy are a key reason why so many Community members appreciate how he brings his personality--and expertise--to every interaction. With all the solutions he provides, his willingness to help the Community learn more about Power Platform, and his sheer joy in life, we are pleased to celebrate Ahmed and all his contributions! You can find him in the Community and on LinkedIn. Congratulations, Ahmed--thank you for being a SUPER user!

Tuesday Tip: Getting Started with Private Messages & Macros

Welcome to TUESDAY TIPS, your weekly connection with the most insightful tips and tricks that empower both newcomers and veterans in the Power Platform Community! Every Tuesday, we bring you a curated selection of the finest advice, distilled from the resources and tools in the Community. Whether you’re a seasoned member or just getting started, Tuesday Tips are the perfect compass guiding you across the dynamic landscape of the Power Platform Community.   As our community family expands each week, we revisit our essential tools, tips, and tricks to ensure you’re well-versed in the community’s pulse. Keep an eye on the News & Announcements for your weekly Tuesday Tips—you never know what you may learn!   This Week's Tip: Private Messaging & Macros in Power Apps Community   Do you want to enhance your communication in the Community and streamline your interactions? One of the best ways to do this is to ensure you are using Private Messaging--and the ever-handy macros that are available to you as a Community member!   Our Knowledge Base article about private messaging and macros is the best place to find out more. Check it out today and discover some key tips and tricks when it comes to messages and macros:   Private Messaging: Learn how to enable private messages in your community profile and ensure you’re connected with other community membersMacros Explained: Discover the convenience of macros—prewritten text snippets that save time when posting in forums or sending private messagesCreating Macros: Follow simple steps to create your own macros for efficient communication within the Power Apps CommunityUsage Guide: Understand how to apply macros in posts and private messages, enhancing your interaction with the Community For detailed instructions and more information, visit the full page in your community today:Power Apps: Enabling Private Messaging & How to Use Macros (Power Apps)Power Automate: Enabling Private Messaging & How to Use Macros (Power Automate)  Copilot Studio: Enabling Private Messaging &How to Use Macros (Copilot Studio) Power Pages: Enabling Private Messaging & How to Use Macros (Power Pages)

Users online (5,316)