cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
Gunsmoke125
Frequent Visitor

Extracting PDF meta data and document info

Is there a way to extract meta data from PDF files? Specifically I need to extract the PDF document keywords property.

 

To be clear, I don’t need to extract text from the PDF file, rather I need to extract text from the document properties.

 

Here’s a method that uses Phython, but I need to do this in a flow.

https://stackoverflow.com/questions/59909520/extracting-the-keywords-from-pdf-metadata-in-python

 

1 ACCEPTED SOLUTION

Accepted Solutions
Gunsmoke125
Frequent Visitor

For those of you looking for a way to extract keywords from PDF meta data, here’s a solution in place of something more elegant.

PDF files (at least the newer version) have the keywords amongst other metadata stored in plain text within the file. If you open a PDF in a text editor such as notepad, you’ll be able to find both an embedded XML section (close to the end of the file) and a proprietary section that has the various metadata attributes.

To extract the keywords (or any other Metadata you might be after) I was able to put the following solution together. It works well.

I’m working from a directory on a file server, but this will work from Sharepoint as well.

  1. You want to get the content of your file
    1.png
  2. Next you want to get the location within the file where the Keywords reside
     
    2.png
    add(int(indexof(string(body('Get_file_content_using_path')),'/Keywords')),11)

    You can see I’ve added 11 characters to the index (the location of the term “/Keywords”. This is to clean up a preceding space and parenthesis that the keywords are contained in.
  3. Next is to locate the end of the Keyworks section. In my file, this is denoted by the term “/Title” for the title metadata
     
    3.png
    int(indexof(string(body('Get_file_content_using_path')),'/Title'))
  4. Next is to get the length of the keywords, so we can extract them in following steps using the substring command
     
    4.png
    sub(sub(outputs('Get_IndexOf_Title_'),2),outputs('Get_IndexOf_Keywords'))
    you’ll note that I am subtracting 2 from the index location of the term “/Title”, this is to remove a close parenthesis and space that are contained at the end of the keywords
  5. Now you want to extract the keywords and split them into an array (be sure to initiate the variable before setting it)
     
    5.png
    split(substring(string(body('Get_file_content_using_path')),outputs('Get_IndexOf_Keywords'),outputs('Get_KeywordLength')),';')

To access the keywords you can reference them using variables(‘Keywords’)[0] where [0] is the location in the array for the variable that you want to reference.


View solution in original post

8 REPLIES 8
v-alzhan-msft
Community Support
Community Support

Hi @Gunsmoke125 ,

 

Sorry for I don't find any function to extract PDF metadata directly in Microsoft currently.

Where is the PDF file save in?

Could you please take a try to save the pdf in sharepoint library, and then use the "Get file metadata" of sharepoint connector to see if the Keywords that you mentioned could be accessed.

 

Best Regards,

Alice

 

Community Support Team _ Alice Zhang
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Jay-Encodian
Community Champion
Community Champion

Hey @Gunsmoke125 

Encodian provide an action but I don't think it will cover the keywords: https://support.encodian.com/hc/en-gb/articles/360002949358-Get-PDF-Document-Information

Could you ping a sample document to support@encodian.com the team will review and feedback?

HTH

Jay

Gunsmoke125
Frequent Visitor

For those of you looking for a way to extract keywords from PDF meta data, here’s a solution in place of something more elegant.

PDF files (at least the newer version) have the keywords amongst other metadata stored in plain text within the file. If you open a PDF in a text editor such as notepad, you’ll be able to find both an embedded XML section (close to the end of the file) and a proprietary section that has the various metadata attributes.

To extract the keywords (or any other Metadata you might be after) I was able to put the following solution together. It works well.

I’m working from a directory on a file server, but this will work from Sharepoint as well.

  1. You want to get the content of your file
    1.png
  2. Next you want to get the location within the file where the Keywords reside
     
    2.png
    add(int(indexof(string(body('Get_file_content_using_path')),'/Keywords')),11)

    You can see I’ve added 11 characters to the index (the location of the term “/Keywords”. This is to clean up a preceding space and parenthesis that the keywords are contained in.
  3. Next is to locate the end of the Keyworks section. In my file, this is denoted by the term “/Title” for the title metadata
     
    3.png
    int(indexof(string(body('Get_file_content_using_path')),'/Title'))
  4. Next is to get the length of the keywords, so we can extract them in following steps using the substring command
     
    4.png
    sub(sub(outputs('Get_IndexOf_Title_'),2),outputs('Get_IndexOf_Keywords'))
    you’ll note that I am subtracting 2 from the index location of the term “/Title”, this is to remove a close parenthesis and space that are contained at the end of the keywords
  5. Now you want to extract the keywords and split them into an array (be sure to initiate the variable before setting it)
     
    5.png
    split(substring(string(body('Get_file_content_using_path')),outputs('Get_IndexOf_Keywords'),outputs('Get_KeywordLength')),';')

To access the keywords you can reference them using variables(‘Keywords’)[0] where [0] is the location in the array for the variable that you want to reference.


View solution in original post

Thank you Alice,

I am using a file server, however I tested the appraoch you suggested and used Sharepoint, but the PDF keyword metadata wasn't available through this route either.

Thanks Jay,

 

Yes I was looking at Encodian as a potential solution and created a trial account, but you are correct, while most of the metadata elements can be extracted using Encodian, the Keywords metadata wasn't one of them.

Hey @Gunsmoke125 

Keywords is now included within the 'Get PDF Document Information' action... it's with Microsoft for deployment and will be available in all regions within 2 weeks

HTH

Jay

andreashaffter
New Member

Hey there

 

@Gunsmoke125   thanks for this workflow!

I'm having a hard time giving this content back to my document propertys. 

 

The aim of my workflow is to write the Title and autor back to the property columns called "Title" and "Autor". But there is always the error message that it can't find them. 

 

Could you help me to implement the workflow to get the requested results?

 

Best regards 

Hi @andreashaffter ,

 

could you share any error messages you might be getting when you run the flow?

 

Kindest Regards

 

DJ

 

 

Helpful resources

Announcements
UG GA Amplification 768x460.png

Launching new user group features

Learn how to create your own user groups today!

Community Connections 768x460.jpg

Community & How To Videos

Check out the new Power Platform Community Connections gallery!

M365 768x460.jpg

Microsoft 365 Collaboration Conference | December 7–9, 2021

Join us, in-person, December 7–9 in Las Vegas, for the largest gathering of the Microsoft community in the world.

Users online (1,838)