Skip to main content
Power Automate
    • Connectors
    • Templates
    • Take a guided tour
    • Digital process automation
    • Robotic process automation
    • Business process automation
    • Process Mining
    • AI Builder
  • Pricing
  • Partners
    • Blog
    • Documentation
    • Roadmap
    • Self-paced learning
    • Webinar
    • Business process and workflow automation topics
    • Overview
    • Issues
    • Give feedback
    • Overview
    • Forums
    • Galleries
    • Submit ideas
    • User groups
    • Register
    • ·
    • Sign in
    • ·
    • Help
    Go To
    • Microsoft Power Automate Community
    • Welcome to the Community!
    • News & Announcements
    • Get Help with Power Automate
    • General Power Automate Discussion
    • Using Connectors
    • Building Flows
    • Using Flows
    • Power Automate Desktop
    • Process Mining
    • AI Builder
    • Power Automate Mobile App
    • Translation Quality Feedback
    • Connector Development
    • Power Platform Integration - Better Together!
    • Power Platform Integrations
    • Power Platform and Dynamics 365 Integrations
    • Galleries
    • Community Connections & How-To Videos
    • Webinars and Video Gallery
    • Power Automate Cookbook
    • Events
    • 2021 MSBizAppsSummit Gallery
    • 2020 MSBizAppsSummit Gallery
    • 2019 MSBizAppsSummit Gallery
    • Community Engagement
    • Community AMA
    • Community Blog
    • Power Automate Community Blog
    • Community Support
    • Community Accounts & Registration
    • Using the Community
    • Community Feedback
    cancel
    Turn on suggestions
    Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.
    Showing results for 
    Search instead for 
    Did you mean: 
    • Microsoft Power Automate Community
    • Galleries
    • Power Automate Cookbook
    • How to extract text out of an Image-based PDF?

    How to extract text out of an Image-based PDF?

    04-26-2023 21:17 PM - last edited 04-26-2023 21:24 PM

    Multi Super User VJR
    Multi Super User
    481 Views
    LinkedIn LinkedIn Facebook Facebook Twitter Twitter
    VJR
    Multi Super User VJR
    Multi Super User
    • Mark as New
    • Bookmark
    • Subscribe
    • Mute
    • Subscribe to RSS Feed
    • Permalink
    • Print
    • Report Inappropriate Content

    How to extract text out of an Image-based PDF?

    ‎04-26-2023 09:17 PM

    As of this writing there is no direct action available to extract text from an Image-based pdf purely via Desktop Flows.

    This needs a different approach to extract the image-based text.

     

    The below article explains how to achieve this step by step.

     

    i) Consider a sample input PDF with below 2 pages.

    Both the pages are Image-based, the text cannot be selected and copied.

     

    VJR_0-1682565839825.png

     

     

    ii) Add the 'Extract images from PDF' action.

     

    VJR_1-1682565998845.png

    Follow the numbering in the above screenshot.

    1. Select the Image based source pdf file.

    2. Choose 'All' assuming you would like to extract all the pages from the pdf.

    3. This is a prefix for the extracted images. For example, if 'Img' is the prefix as shown above then the extracted images from the pdf will have file names as Img_0, Img_1 and so on.

    4. Choose a folder where the extracted images will be saved.

     

    iii) Get all the files from the folder chosen in No 4. above

    (You can use a variable for the path in the above step and the same here)

    VJR_2-1682566262180.png

     

    iv) The Output variable produced in the above step is seen as the Files variable.

    Use a 'For each' loop to iterate through the files in this folder.

     

    VJR_3-1682566410502.png

     

     

    v) Inside the loop we will be using the 'Extract text with OCR' on each of the extracted images.

    Make sure to select 'Image on disk' and %CurrentItem% as the variable where each of the image path will be stored.

     

    VJR_4-1682566488692.png

     

     

    vi) As you can see above, the extracted text is saved in an output variable called 'OcrText'.

    For demonstration purposes we will be writing this variable to a text file.

     

    VJR_5-1682566691476.png

     

    Follow the numbering above:

    1. Full path to an Output text file

    2. The output text variable coming from Step v.

    I have also added a dotted line to act as a separator between the outputs of the two pages of the PDF.

    3. Make sure to append the content and not overwrite it.

     

    vii) If you followed along, the entire flow and its variables look like the below.

     

    VJR_6-1682567000571.png

     

     

    viii) On running the Flow, three files will the created in the given Output path.

    Notice the Img prefix for the two extracted images and one single Output text file.

     

    VJR_7-1682567178866.png

    ix) The Output file shows the output of both the pages of the PDF with a separator between them.

     

    These are now in Text format.

     

    Note that this article does not explain how to retrieve only the specific part from the extracted text, for example the word 'England' from the text.
    Those will have to be done using Regex or Text functions.

    The good part now is that the text that were images earlier are now in Text format. 

     

    VJR_8-1682567370310.png

     

    x) Sample Desktop Flow and Image based PDF attached.

    Change the path and try it out.

    This Flow was built using Version 2.31

     

    xi) How to copy-paste any desktop flow from its raw form:

     

    Copy Paste Desktop Flows.gif

     

     

        

     

    Extract Text from Image based PDF.zip
    Labels:
    • Labels:
    • Desktop flows
    Message 1 of 1
    481 Views
    0 Kudos
    Reply
    • All forum topics
    • Previous Topic
    • Next Topic

    Power Platform

    • Overview
    • Power BI
    • Power Apps
    • Power Pages
    • Power Automate
    • Power Virtual Agents

    • Sign up free
    • Sign in

    Browse

    • Templates
    • Connectors
    • Partners

    Downloads

    • Mobile
    • Gateway

    Learn

    • Documentation
    • Learn
    • Support
    • Community
    • Give feedback
    • Blog
    • Pricing

    • © 2023 Microsoft
    • Contact us
    • Trademarks
    • Privacy & cookies
    • Manage cookies
    • Terms of use
    • Terms & conditions
    Consumer Privacy Act (CCPA) Opt-Out Icon Your Privacy Choices