Solved: Extracting PDF meta data and document info

Gunsmoke125 · ‎05-12-2020

Is there a way to extract meta data from PDF files? Specifically I need to extract the PDF document keywords property.

To be clear, I don’t need to extract text from the PDF file, rather I need to extract text from the document properties.

Here’s a method that uses Phython, but I need to do this in a flow.

https://stackoverflow.com/questions/59909520/extracting-the-keywords-from-pdf-metadata-in-python

Gunsmoke125 · ‎05-14-2020

For those of you looking for a way to extract keywords from PDF meta data, here’s a solution in place of something more elegant.

PDF files (at least the newer version) have the keywords amongst other metadata stored in plain text within the file. If you open a PDF in a text editor such as notepad, you’ll be able to find both an embedded XML section (close to the end of the file) and a proprietary section that has the various metadata attributes.

To extract the keywords (or any other Metadata you might be after) I was able to put the following solution together. It works well.

I’m working from a directory on a file server, but this will work from Sharepoint as well.

You want to get the content of your file
Next you want to get the location within the file where the Keywords reside

add(int(indexof(string(body('Get_file_content_using_path')),'/Keywords')),11)

You can see I’ve added 11 characters to the index (the location of the term “/Keywords”. This is to clean up a preceding space and parenthesis that the keywords are contained in.
Next is to locate the end of the Keyworks section. In my file, this is denoted by the term “/Title” for the title metadata

int(indexof(string(body('Get_file_content_using_path')),'/Title'))
Next is to get the length of the keywords, so we can extract them in following steps using the substring command

sub(sub(outputs('Get_IndexOf_Title_'),2),outputs('Get_IndexOf_Keywords'))
you’ll note that I am subtracting 2 from the index location of the term “/Title”, this is to remove a close parenthesis and space that are contained at the end of the keywords
Now you want to extract the keywords and split them into an array (be sure to initiate the variable before setting it)

split(substring(string(body('Get_file_content_using_path')),outputs('Get_IndexOf_Keywords'),outputs('Get_KeywordLength')),';')

To access the keywords you can reference them using variables(‘Keywords’)[0] where [0] is the location in the array for the variable that you want to reference.

View solution in original post

v-alzhan-msft · ‎05-13-2020

Hi @Gunsmoke125 ,

Sorry for I don't find any function to extract PDF metadata directly in Microsoft currently.

Where is the PDF file save in?

Could you please take a try to save the pdf in sharepoint library, and then use the "Get file metadata" of sharepoint connector to see if the Keywords that you mentioned could be accessed.

Best Regards,

Alice

Community Support Team _ Alice Zhang
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Jay-Encodian · ‎05-13-2020

Hey @Gunsmoke125

Encodian provide an action but I don't think it will cover the keywords: https://support.encodian.com/hc/en-gb/articles/360002949358-Get-PDF-Document-Information

Could you ping a sample document to support@encodian.com the team will review and feedback?

HTH

Jay

Gunsmoke125 · ‎05-14-2020

For those of you looking for a way to extract keywords from PDF meta data, here’s a solution in place of something more elegant.

PDF files (at least the newer version) have the keywords amongst other metadata stored in plain text within the file. If you open a PDF in a text editor such as notepad, you’ll be able to find both an embedded XML section (close to the end of the file) and a proprietary section that has the various metadata attributes.

To extract the keywords (or any other Metadata you might be after) I was able to put the following solution together. It works well.

I’m working from a directory on a file server, but this will work from Sharepoint as well.

You want to get the content of your file
Next you want to get the location within the file where the Keywords reside

add(int(indexof(string(body('Get_file_content_using_path')),'/Keywords')),11)

You can see I’ve added 11 characters to the index (the location of the term “/Keywords”. This is to clean up a preceding space and parenthesis that the keywords are contained in.
Next is to locate the end of the Keyworks section. In my file, this is denoted by the term “/Title” for the title metadata

int(indexof(string(body('Get_file_content_using_path')),'/Title'))
Next is to get the length of the keywords, so we can extract them in following steps using the substring command

sub(sub(outputs('Get_IndexOf_Title_'),2),outputs('Get_IndexOf_Keywords'))
you’ll note that I am subtracting 2 from the index location of the term “/Title”, this is to remove a close parenthesis and space that are contained at the end of the keywords
Now you want to extract the keywords and split them into an array (be sure to initiate the variable before setting it)

split(substring(string(body('Get_file_content_using_path')),outputs('Get_IndexOf_Keywords'),outputs('Get_KeywordLength')),';')

To access the keywords you can reference them using variables(‘Keywords’)[0] where [0] is the location in the array for the variable that you want to reference.

Gunsmoke125 · ‎05-14-2020

Thank you Alice,

I am using a file server, however I tested the appraoch you suggested and used Sharepoint, but the PDF keyword metadata wasn't available through this route either.

Gunsmoke125 · ‎05-14-2020

Thanks Jay,

Yes I was looking at Encodian as a potential solution and created a trial account, but you are correct, while most of the metadata elements can be extracted using Encodian, the Keywords metadata wasn't one of them.

Jay-Encodian · ‎05-18-2020

Hey @Gunsmoke125

Keywords is now included within the 'Get PDF Document Information' action... it's with Microsoft for deployment and will be available in all regions within 2 weeks

HTH

Jay

andreashaffter · ‎11-08-2020

Hey there

@Gunsmoke125 thanks for this workflow!

I'm having a hard time giving this content back to my document propertys.

The aim of my workflow is to write the Title and autor back to the property columns called "Title" and "Autor". But there is always the error message that it can't find them.

Could you help me to implement the workflow to get the requested results?

Best regards