cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
Gunsmoke125
Frequent Visitor

Extracting PDF meta data and document info

Is there a way to extract meta data from PDF files? Specifically I need to extract the PDF document keywords property.

 

To be clear, I don’t need to extract text from the PDF file, rather I need to extract text from the document properties.

 

Here’s a method that uses Phython, but I need to do this in a flow.

https://stackoverflow.com/questions/59909520/extracting-the-keywords-from-pdf-metadata-in-python

 

1 ACCEPTED SOLUTION

Accepted Solutions
Gunsmoke125
Frequent Visitor

For those of you looking for a way to extract keywords from PDF meta data, here’s a solution in place of something more elegant.

PDF files (at least the newer version) have the keywords amongst other metadata stored in plain text within the file. If you open a PDF in a text editor such as notepad, you’ll be able to find both an embedded XML section (close to the end of the file) and a proprietary section that has the various metadata attributes.

To extract the keywords (or any other Metadata you might be after) I was able to put the following solution together. It works well.

I’m working from a directory on a file server, but this will work from Sharepoint as well.

  1. You want to get the content of your file
    1.png
  2. Next you want to get the location within the file where the Keywords reside
     
    2.png
    add(int(indexof(string(body('Get_file_content_using_path')),'/Keywords')),11)

    You can see I’ve added 11 characters to the index (the location of the term “/Keywords”. This is to clean up a preceding space and parenthesis that the keywords are contained in.
  3. Next is to locate the end of the Keyworks section. In my file, this is denoted by the term “/Title” for the title metadata
     
    3.png
    int(indexof(string(body('Get_file_content_using_path')),'/Title'))
  4. Next is to get the length of the keywords, so we can extract them in following steps using the substring command
     
    4.png
    sub(sub(outputs('Get_IndexOf_Title_'),2),outputs('Get_IndexOf_Keywords'))
    you’ll note that I am subtracting 2 from the index location of the term “/Title”, this is to remove a close parenthesis and space that are contained at the end of the keywords
  5. Now you want to extract the keywords and split them into an array (be sure to initiate the variable before setting it)
     
    5.png
    split(substring(string(body('Get_file_content_using_path')),outputs('Get_IndexOf_Keywords'),outputs('Get_KeywordLength')),';')

To access the keywords you can reference them using variables(‘Keywords’)[0] where [0] is the location in the array for the variable that you want to reference.


View solution in original post

9 REPLIES 9
v-alzhan-msft
Community Support
Community Support

Hi @Gunsmoke125 ,

 

Sorry for I don't find any function to extract PDF metadata directly in Microsoft currently.

Where is the PDF file save in?

Could you please take a try to save the pdf in sharepoint library, and then use the "Get file metadata" of sharepoint connector to see if the Keywords that you mentioned could be accessed.

 

Best Regards,

Alice

 

Community Support Team _ Alice Zhang
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Jay-Encodian
Community Champion
Community Champion

Hey @Gunsmoke125 

Encodian provide an action but I don't think it will cover the keywords: https://support.encodian.com/hc/en-gb/articles/360002949358-Get-PDF-Document-Information

Could you ping a sample document to support@encodian.com the team will review and feedback?

HTH

Jay

Gunsmoke125
Frequent Visitor

For those of you looking for a way to extract keywords from PDF meta data, here’s a solution in place of something more elegant.

PDF files (at least the newer version) have the keywords amongst other metadata stored in plain text within the file. If you open a PDF in a text editor such as notepad, you’ll be able to find both an embedded XML section (close to the end of the file) and a proprietary section that has the various metadata attributes.

To extract the keywords (or any other Metadata you might be after) I was able to put the following solution together. It works well.

I’m working from a directory on a file server, but this will work from Sharepoint as well.

  1. You want to get the content of your file
    1.png
  2. Next you want to get the location within the file where the Keywords reside
     
    2.png
    add(int(indexof(string(body('Get_file_content_using_path')),'/Keywords')),11)

    You can see I’ve added 11 characters to the index (the location of the term “/Keywords”. This is to clean up a preceding space and parenthesis that the keywords are contained in.
  3. Next is to locate the end of the Keyworks section. In my file, this is denoted by the term “/Title” for the title metadata
     
    3.png
    int(indexof(string(body('Get_file_content_using_path')),'/Title'))
  4. Next is to get the length of the keywords, so we can extract them in following steps using the substring command
     
    4.png
    sub(sub(outputs('Get_IndexOf_Title_'),2),outputs('Get_IndexOf_Keywords'))
    you’ll note that I am subtracting 2 from the index location of the term “/Title”, this is to remove a close parenthesis and space that are contained at the end of the keywords
  5. Now you want to extract the keywords and split them into an array (be sure to initiate the variable before setting it)
     
    5.png
    split(substring(string(body('Get_file_content_using_path')),outputs('Get_IndexOf_Keywords'),outputs('Get_KeywordLength')),';')

To access the keywords you can reference them using variables(‘Keywords’)[0] where [0] is the location in the array for the variable that you want to reference.


Thank you Alice,

I am using a file server, however I tested the appraoch you suggested and used Sharepoint, but the PDF keyword metadata wasn't available through this route either.

Thanks Jay,

 

Yes I was looking at Encodian as a potential solution and created a trial account, but you are correct, while most of the metadata elements can be extracted using Encodian, the Keywords metadata wasn't one of them.

Hey @Gunsmoke125 

Keywords is now included within the 'Get PDF Document Information' action... it's with Microsoft for deployment and will be available in all regions within 2 weeks

HTH

Jay

andreashaffter
New Member

Hey there

 

@Gunsmoke125   thanks for this workflow!

I'm having a hard time giving this content back to my document propertys. 

 

The aim of my workflow is to write the Title and autor back to the property columns called "Title" and "Autor". But there is always the error message that it can't find them. 

 

Could you help me to implement the workflow to get the requested results?

 

Best regards 

Hi @andreashaffter ,

 

could you share any error messages you might be getting when you run the flow?

 

Kindest Regards

 

DJ

 

 

mersancanonigo
Frequent Visitor

Sadly, this doesn't work with PDF with both horizontal and vertical pages.

Helpful resources

Announcements

Tuesday Tips: Getting Started in the Community

TUESDAY TIPS is back!   This weekly series of posts is our way of sharing helpful things we've learned or shared that have helped members of the Community. Whether you're just getting started or you're a seasoned pro, Tuesday Tips will help you know where to go, what to look for, and navigate your way through the ever-growing--and ever-changing--world of the Power Platform Community! The original run of Tuesday Tips was a highlight of last year, and these all-new Tips will hopefully prove to be just as informative as helpful. We will cover some basics about the Community, a few "insider tips" to make your experience even better, and sharing best practices gleaned from our most active community members and Super Users. Make sure to watch the News & Announcements each week for the latest and greatest Tuesday Tips!   THIS WEEK: I'm Brand New! What Do I Do? The number of new community members we have each week is pretty amazing, and we are so glad to welcome all of you to the Community! You may be wondering. "What do I do? Where do I get started? Will anyone be willing to help me? What I have a question? Help!"   Let's start with this: Welcome to the low-code revolution, and more importantly, welcome to the Power Platform Community! This is a great place to start. Whether you're busy with Power Apps, getting familiar with Power Automate, engaging Copilot Studio, or building in Power Pages, there are a few key places you should check out as you begin your journey: FORUMS: The forums are THE place to ask questions, look at questions asked by other Community members—and see answers and solutions from our Super Users and other helpful people in the Community. Power Apps ForumsPower Automate ForumsCopilot Studio ForumsPower Pages Forums   NEWS & ANNOUNCEMENTS: Our News & Announcements section highlights the newest and greatest updates in the Community, news from the product team, and so much more. It’s updated a few times each week and will also help you find ways to connect with what’s going on in the ever-growing world of Power Platform. Power Apps News & AnnouncementsPower Automate News & AnnouncementsCopilot Studio News & AnnouncementsPower Pages News & Announcements   GALLERIES: The Galleries section of the Community features tons of tips and tricks, features and benefits, and more—through videos created by our Super Users, product teams, and other helpful members of the Community. Power Apps GalleriesPower Automate Galleries Copilot Studio GalleriesPower Pages Galleries BLOGS: The community blogs section is full of handy step-by-step tips from members of the Community—and some of them include detailed answers to some of the questions most frequently asked questions, as well as how they solved a problem they faced. Power Apps Community BlogPower Automate Community BlogCopilot Studio Community BlogPower Pages Community Blog POWER UP PROGRAM: If you’d like to really take a huge step forward in your journey, we recommend checking out the Power Up Program, a Microsoft-sponsored initiative that trains new Power Platform users and has been a huge success since it launched a little over a year ago. There’s a waiting list, so definitely apply soon if you’re interested! Find out more here: Microsoft Power Up Program for career switchers.   There's so much more you'll discover in your Power Platform experience, and this Community is here for YOU! We are glad you've discovered us and can't wait to see where you grow! If you're new to the Community and just getting started, make sure to give this post a kudo and introduce yourself so we can welcome you!

Super User of the Month | Drew Poggemann

As part of a new monthly feature in the Community, we are excited to share that Drew Poggemann is our featured Super User for the month of February 2024. If you've been in the Community for a while, we're sure Drew's name is familiar to you, as he is one of our most active contributors--he's been a Super User for five consecutive seasons!   Since authoring his first reply 5 years ago to his 514th solution authored, Drew has helped countless Community members with his insights and expertise. In addition to being a Super User, Drew is also a User Group leader and a Microsoft MVP. His contributions to our Super User sessions and to the new SUIT program are always welcome--as well as his sense of humor and fun-loving way of sharing what he knows with others.   When Drew is not solving problems and authoring solutions, he's busy overseeing the Solution Architecture team at HBS, specializing in application architecture and business solution strategy--something he's been doing for over 30 years. We are grateful for Drew and the amazing way he has used his talent and skills to help so many others in the Community. If you are part of the SUIT program, you got to hear some great tips from Drew at the first SUIT session--and we know he still has much more to share!You can find him in the Community and on LinkedIn. Thank you for all you do, Drew!

Announcing Power Apps Copilot Cookbook Gallery

We are excited to share that the all-new Copilot Cookbook Gallery for Power Apps is now available in the Power Apps Community, full of tips and tricks on how to best use Microsoft Copilot as you develop and create in Power Apps. The new Copilot Cookbook is your go-to resource when you need inspiration--or when you're stuck--and aren't sure how to best partner with Copilot while creating apps.   Whether you're looking for the best prompts or just want to know about responsible AI use, visit Copilot Cookbook for regular updates you can rely on--while also serving up some of your greatest tips and tricks for the Community. Our team will be reviewing posts using the new "Copilot Studio" label to ensure we highlight and amplify the most relevant and recent content, so you're assured of high-quality content every time you visit. If you share a post that gets featured in the curated gallery, you'll get a PM in the Community to let you know!The curated gallery is ready for you to experience now, so visit the new Copilot Cookbook for Power Apps today: Copilot Cookbook - Power Platform Community. We can't wait to see what you "cook" up!    

Celebrating a New Season of Super Users with Charles Lamanna, CVP Microsoft Business Applications

February 8 was the kickoff to the 2024 Season One Super User program for Power Platform Communities, and we are thrilled to welcome back so many returning Super Users--as well as so many brand new Super Users who started their journey last fall. Our Community Super Users are the true heroes, answering questions, providing solutions, filtering spam, and so much more. The impact they make on the Communities each day is significant, and we wanted to do something special to welcome them at our first kickoff meeting of the year.   Charles Lamanna, Microsoft CVP of Business Applications, has stressed frequently how valuable our Community is to the growth and potential of Power Platform, and we are honored to share this message from him to our 2024 Season One Super Users--as well as anyone who might be interested in joining this elite group of Community members.     If you want to know more about Super Users, check out these posts for more information today:    Power Apps: What is A Super User? - Power Platform CommunityPower Automate: What is A Super User? - Power Platform Community Copilot Studio: What is A Super User? - Power Platform Community Power Pages: What is A Super User? - Power Platform Community

Super Users 2024 Season One is Here!

   We are excited to announce the first season of our 2024 Super Users is here! Our kickoff to the new year welcomes many returning Super Users and several new faces, and it's always exciting to see the impact these incredible individuals will have on the Community in 2024! We are so grateful for the daily difference they make in the Community already and know they will keep staying engaged and excited for all that will happen this year.   How to Spot a Super User in the Community:Have you ever written a post or asked for help in the Community and had it answered by a user with the Super User icon next to their name? It means you have found the actual, real-life superheroes of the Power Platform Community! Super Users are our heroes because of the way they consistently make a difference in the Community. Our amazing Super Users help keep the Community a safe place by flagging spam and letting the Community Managers know about issues. They also make the Community a great place to find answers, because they are often the first to offer solutions and get clarity on questions. Finally, Super Users share valuable insights on ways to keep the Community growing, engaging, and looking ahead!We are honored to reveal the new badges for this season of Super Users! Congratulations to all the new and returning Super Users!     To better answer the question "What is a Super User?" please check out this article: Power Apps: What is A Super User? - Power Platform CommunityPower Automate: What is A Super User? - Power Platform Community Copilot Studio: What is A Super User? - Power Platform Community Power Pages: What is A Super User? - Power Platform Community

Microsoft Power Platform | 2024 Release Wave 1 Plan

Check out the latest Microsoft Power Platform release plans for 2024!   We have a whole host of exciting new features to help you be more productive, enhance delegation, run automated testing, build responsive pages, and so much more.    Click the links below to see not only our forthcoming releases, but to also try out some of the new features that have recently been released to market across:     Power Apps  Power Automate  Copilot Studio   We can’t wait to share with you all the upcoming releases that will help take your Power Platform experience to the next level!    Check out the entire Release Wave: Power Platform Complete Release Planner 

Users online (6,957)