cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
Gunsmoke125
Frequent Visitor

Extracting PDF meta data and document info

Is there a way to extract meta data from PDF files? Specifically I need to extract the PDF document keywords property.

 

To be clear, I don’t need to extract text from the PDF file, rather I need to extract text from the document properties.

 

Here’s a method that uses Phython, but I need to do this in a flow.

https://stackoverflow.com/questions/59909520/extracting-the-keywords-from-pdf-metadata-in-python

 

1 ACCEPTED SOLUTION

Accepted Solutions
Gunsmoke125
Frequent Visitor

For those of you looking for a way to extract keywords from PDF meta data, here’s a solution in place of something more elegant.

PDF files (at least the newer version) have the keywords amongst other metadata stored in plain text within the file. If you open a PDF in a text editor such as notepad, you’ll be able to find both an embedded XML section (close to the end of the file) and a proprietary section that has the various metadata attributes.

To extract the keywords (or any other Metadata you might be after) I was able to put the following solution together. It works well.

I’m working from a directory on a file server, but this will work from Sharepoint as well.

  1. You want to get the content of your file
    1.png
  2. Next you want to get the location within the file where the Keywords reside
     
    2.png
    add(int(indexof(string(body('Get_file_content_using_path')),'/Keywords')),11)

    You can see I’ve added 11 characters to the index (the location of the term “/Keywords”. This is to clean up a preceding space and parenthesis that the keywords are contained in.
  3. Next is to locate the end of the Keyworks section. In my file, this is denoted by the term “/Title” for the title metadata
     
    3.png
    int(indexof(string(body('Get_file_content_using_path')),'/Title'))
  4. Next is to get the length of the keywords, so we can extract them in following steps using the substring command
     
    4.png
    sub(sub(outputs('Get_IndexOf_Title_'),2),outputs('Get_IndexOf_Keywords'))
    you’ll note that I am subtracting 2 from the index location of the term “/Title”, this is to remove a close parenthesis and space that are contained at the end of the keywords
  5. Now you want to extract the keywords and split them into an array (be sure to initiate the variable before setting it)
     
    5.png
    split(substring(string(body('Get_file_content_using_path')),outputs('Get_IndexOf_Keywords'),outputs('Get_KeywordLength')),';')

To access the keywords you can reference them using variables(‘Keywords’)[0] where [0] is the location in the array for the variable that you want to reference.


View solution in original post

9 REPLIES 9
v-alzhan-msft
Community Support
Community Support

Hi @Gunsmoke125 ,

 

Sorry for I don't find any function to extract PDF metadata directly in Microsoft currently.

Where is the PDF file save in?

Could you please take a try to save the pdf in sharepoint library, and then use the "Get file metadata" of sharepoint connector to see if the Keywords that you mentioned could be accessed.

 

Best Regards,

Alice

 

Community Support Team _ Alice Zhang
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Jay-Encodian
Community Champion
Community Champion

Hey @Gunsmoke125 

Encodian provide an action but I don't think it will cover the keywords: https://support.encodian.com/hc/en-gb/articles/360002949358-Get-PDF-Document-Information

Could you ping a sample document to support@encodian.com the team will review and feedback?

HTH

Jay

Gunsmoke125
Frequent Visitor

For those of you looking for a way to extract keywords from PDF meta data, here’s a solution in place of something more elegant.

PDF files (at least the newer version) have the keywords amongst other metadata stored in plain text within the file. If you open a PDF in a text editor such as notepad, you’ll be able to find both an embedded XML section (close to the end of the file) and a proprietary section that has the various metadata attributes.

To extract the keywords (or any other Metadata you might be after) I was able to put the following solution together. It works well.

I’m working from a directory on a file server, but this will work from Sharepoint as well.

  1. You want to get the content of your file
    1.png
  2. Next you want to get the location within the file where the Keywords reside
     
    2.png
    add(int(indexof(string(body('Get_file_content_using_path')),'/Keywords')),11)

    You can see I’ve added 11 characters to the index (the location of the term “/Keywords”. This is to clean up a preceding space and parenthesis that the keywords are contained in.
  3. Next is to locate the end of the Keyworks section. In my file, this is denoted by the term “/Title” for the title metadata
     
    3.png
    int(indexof(string(body('Get_file_content_using_path')),'/Title'))
  4. Next is to get the length of the keywords, so we can extract them in following steps using the substring command
     
    4.png
    sub(sub(outputs('Get_IndexOf_Title_'),2),outputs('Get_IndexOf_Keywords'))
    you’ll note that I am subtracting 2 from the index location of the term “/Title”, this is to remove a close parenthesis and space that are contained at the end of the keywords
  5. Now you want to extract the keywords and split them into an array (be sure to initiate the variable before setting it)
     
    5.png
    split(substring(string(body('Get_file_content_using_path')),outputs('Get_IndexOf_Keywords'),outputs('Get_KeywordLength')),';')

To access the keywords you can reference them using variables(‘Keywords’)[0] where [0] is the location in the array for the variable that you want to reference.


Thank you Alice,

I am using a file server, however I tested the appraoch you suggested and used Sharepoint, but the PDF keyword metadata wasn't available through this route either.

Thanks Jay,

 

Yes I was looking at Encodian as a potential solution and created a trial account, but you are correct, while most of the metadata elements can be extracted using Encodian, the Keywords metadata wasn't one of them.

Hey @Gunsmoke125 

Keywords is now included within the 'Get PDF Document Information' action... it's with Microsoft for deployment and will be available in all regions within 2 weeks

HTH

Jay

andreashaffter
New Member

Hey there

 

@Gunsmoke125   thanks for this workflow!

I'm having a hard time giving this content back to my document propertys. 

 

The aim of my workflow is to write the Title and autor back to the property columns called "Title" and "Autor". But there is always the error message that it can't find them. 

 

Could you help me to implement the workflow to get the requested results?

 

Best regards 

Hi @andreashaffter ,

 

could you share any error messages you might be getting when you run the flow?

 

Kindest Regards

 

DJ

 

 

mersancanonigo
Frequent Visitor

Sadly, this doesn't work with PDF with both horizontal and vertical pages.

Helpful resources

Announcements

Updates to Transitions in the Power Platform Communities

We're embarking on a journey to enhance your experience by transitioning to a new community platform. Our team has been diligently working to create a fresh community site, leveraging the very Dynamics 365 and Power Platform tools our community advocates for.  We started this journey with transitioning Copilot Studio forums and blogs in June. The move marks the beginning of a new chapter, and we're eager for you to be a part of it. The rest of the Power Platform product sites will be moving over this summer.   Stay tuned for more updates as we get closer to the launch. We can't wait to welcome you to our new community space, designed with you in mind. Let's connect, learn, and grow together.   Here's to new beginnings and endless possibilities!   If you have any questions, observations or concerns throughout this process please go to https://aka.ms/PPCommSupport.   To stay up to date on the latest details of this migration and other important Community updates subscribe to our News and Announcements forums: Copilot Studio, Power Apps, Power Automate, Power Pages

Your Moment to Shine: 2024 PPCC’s Got Power Awards Show

For the third year, we invite you, our talented community members, to participate in the grand 2024 Power Platform Community Conference's Got Power Awards. This event is your opportunity to showcase solutions that make a significant business impact, highlight extensive use of Power Platform products, demonstrate good governance, or tell an inspirational story. Share your success stories, inspire your peers, and show off some hidden talents.  This is your time to shine and bring your creations into the spotlight!  Make your mark, inspire others and leave a lasting impression. Sign up today for a chance to showcase your solution and win the coveted 2024 PPCC’s Got Power Award. This year we have three categories for you to participate in: Technical Solution Demo, Storytelling, and Hidden Talent.      The Technical solution demo category showcases your applications, automated workflows, copilot agentic experiences, web pages, AI capabilities, dashboards, and/or more. We want to see your most impactful Power Platform solutions!  The Storytelling category is where you can share your inspiring story, and the Hidden Talent category is where your talents (such as singing, dancing, jump roping, etc.) can shine! Submission Details:  Fill out the submission form https://aka.ms/PPCCGotPowerSignup by the end of July with details and a 2–5-minute video showcasing your Solution impact. (Please let us know you're coming to PPCC, too!)After review by a panel of Microsoft judges, the top storytellers will be invited to present a virtual demo presentation to the judges during early August. You’ll be notified soon after if you have been selected as a finalist to share your story live at PPCC’s Got Power!  The live show will feature the solution demos and storytelling talents of the top contestants, winner announcements, and the opportunity to network with your community.  It's not just a showcase for technical talent and storytelling showmanship, show it's a golden opportunity to make connections and celebrate our Community together! Let's make this a memorable event! See you there!   Mark your calendars! Date and Time: Thursday, Sept 19th Location: PPCC24 at the MGM Grand, Las Vegas, NV 

Update! June 13th, Community Ambassador Call for User Group Leaders and Super Users

Calling all Super Users & User Group Leaders   UPDATE:  We just wrapped up June's Community Ambassador monthly calls for Super Users and User Group Leaders. We had a fantastic call with lots of engagement. We are excited to share some highlights with you!    Big THANK YOU to our special guest Thomas Verhasselt, from the Copilot Studio Product Team for sharing how to use Power Platform Templates to achieve next generation growth.     A few key takeaways: Copilot Studio Cookbook Challenge:  Week 1 results are posted, Keep up the great work!Summer of Solutions:  Starting on Monday, June 17th. Just by providing solutions in the community, you can be entered to win tickets to Power Platform Community Conference.Super User Season 2: Coming SoonAll communities moving to the new platform end of July We also honored two different community members during the call, Mohamed Amine Mahmoudi and Markus Franz! We are thankful for both leaders' contributions and engagement with their respective communities. 🎉   Be sure to mark your calendars and register for the meeting on July 11th and stay up to date on all of the changes that are coming. Check out the Super User Forum boards for details.   We're excited to connect with you and continue building a stronger community together.   See you at the call!

Copilot Cookbook Challenge | Week 2 Results | Win Tickets to the Power Platform Conference

We are excited to announce the "The Copilot Cookbook Community Challenge is a great way to showcase your creativity and connect with others. Plus, you could win tickets to the Power Platform Community Conference in Las Vegas in September 2024 as an amazing bonus.   Two ways to enter: 1. Copilot Studio Cookbook Gallery:  https://aka.ms/CS_Copilot_Cookbook_Challenge 2. Power Apps Copilot Cookbook Gallery: https://aka.ms/PA_Copilot_Cookbook_Challenge   There will be 5 chances to qualify for the final drawing: Early Bird Entries: March 1 - June 2Week 1: June 3 - June 9Week 2: June 10 - June 16Week 3: June 17 - June 23Week 4: June 24 - June 30     At the end of each week, we will draw 5 random names from every user who has posted a qualifying Copilot Studio template, sample or demo in the Copilot Studio Cookbook or a qualifying Power Apps Copilot sample or demo in the Power Apps Copilot Cookbook. Users who are not drawn in a given week will be added to the pool for the next week. Users can qualify more than once, but no more than once per week. Four winners will be drawn at random from the total qualifying entrants. If a winner declines, we will draw again at random for the next winner.  A user will only be able to win once. If they are drawn multiple times, another user will be drawn at random. Prizes:  One Pass to the Power Platform Conference in Las Vegas, Sep. 18-20, 2024 ($1800 value, does not include travel, lodging, or any other expenses) Winners are also eligible to do a 10-minute presentation of their demo or solution in a community solutions showcase at the event. To qualify for the drawing, templates, samples or demos must be related to Copilot Studio or a Copilot feature of Power Apps, Power Automate, or Power Pages, and must demonstrate or solve a complete unique and useful business or technical problem. Power Automate and Power Pagers posts should be added to the Power Apps Cookbook. Final determination of qualifying entries is at the sole discretion of Microsoft. Weekly updates and the Final random winners will be posted in the News & Announcements section in the communities on July 29th, 2024. Did you submit entries early?  Early Bird Entries March 1 - June 2:  If you posted something in the "early bird" time frame complete this form: https://aka.ms/Copilot_Challenge_EarlyBirds if you would like to be entered in the challenge.   Week 1 Results:  Congratulations to the Week 1 qualifiers, you are being entered in the random drawing that will take place at the end of the challenge. Copilot Cookbook Gallery:Power Apps Cookbook Gallery:1.  @Mathieu_Paris 1.   @SpongYe 2.  @Dhanush 2.   @Deenuji 3.  n/a3.   @Nived_Nambiar  4.  n/a4.   @ManishSolanki 5.  n/a5.    n/a   Week 2 Results:  Congratulations to the Week 2 qualifiers, you are being entered in the random drawing that will take place at the end of the challenge. Copilot Cookbook Gallery:Power Apps Cookbook Gallery:1. Kasun_Pathirana1. ManishSolanki2. cloudatica2. madlad3. n/a3. SpongYe4. n/a4. n/a5. n/a5. n/a

Win free tickets to the Power Platform Conference | Summer of Solutions

We are excited to announce the Summer of Solutions Challenge!    This challenge is kicking off on Monday, June 17th and will run for (4) weeks.  The challenge is open to all Power Platform (Power Apps, Power Automate, Copilot Studio & Power Pages) community members. We invite you to participate in a quest to provide solutions to as many questions as you can. Answers can be provided in all the communities.    Entry Period: This Challenge will consist of four weekly Entry Periods as follows (each an “Entry Period”)   - 12:00 a.m. PT on June 17, 2024 – 11:59 p.m. PT on June 23, 2024 - 12:00 a.m. PT on June 24, 2024 – 11:59 p.m. PT on June 30, 2024 - 12:00 a.m. PT on July 1, 2024 – 11:59 p.m. PT on July 7, 2024 - 12:00 a.m. PT on July 8, 2024 – 11:59 p.m. PT on July 14, 2024   Entries will be eligible for the Entry Period in which they are received and will not carryover to subsequent weekly entry periods.  You must enter into each weekly Entry Period separately.   How to Enter: We invite you to participate in a quest to provide "Accepted Solutions" to as many questions as you can. Answers can be provided in all the communities. Users must provide a solution which can be an “Accepted Solution” in the Forums in all of the communities and there are no limits to the number of “Accepted Solutions” that a member can provide for entries in this challenge, but each entry must be substantially unique and different.    Winner Selection and Prizes: At the end of each week, we will list the top ten (10) Community users which will consist of: 5 Community Members & 5 Super Users and they will advance to the final drawing. We will post each week in the News & Announcements the top 10 Solution providers.  At the end of the challenge, we will add all of the top 10 weekly names and enter them into a random drawing.  Then we will randomly select ten (10) winners (5 Community Members & 5 Super Users) from among all eligible entrants received across all weekly Entry Periods to receive the prize listed below. If a winner declines, we will draw again at random for the next winner.  A user will only be able to win once overall. If they are drawn multiple times, another user will be drawn at random.  Individuals will be contacted before the announcement with the opportunity to claim or deny the prize.  Once all of the winners have been notified, we will post in the News & Announcements of each community with the list of winners.   Each winner will receive one (1) Pass to the Power Platform Conference in Las Vegas, Sep. 18-20, 2024 ($1800 value). NOTE: Prize is for conference attendance only and any other costs such as airfare, lodging, transportation, and food are the sole responsibility of the winner. Tickets are not transferable to any other party or to next year’s event.   ** PLEASE SEE THE ATTACHED RULES for this CHALLENGE**

Celebrating the June Super User of the Month: Markus Franz

Markus Franz is a phenomenal contributor to the Power Apps Community. Super Users like Markus inspire others through their example, encouragement, and active participation.    The Why: "I do this to help others achieve what they are trying to do. As a total beginner back then without IT background I know how overwhelming things can be, so I decided to jump in and help others. I also do this to keep progressing and learning myself." Thank you, Markus Franz, for your outstanding work! Keep inspiring others and making a difference in the community! 🎉  Keep up the fantastic work! 👏👏 Markus Franz | LinkedIn  Power Apps: mmbr1606  

Users online (4,259)