cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
pjwilson87
Frequent Visitor

Web Scraping - Capture URL

Hey PAD folks,

 

I am new to PAD and trying to learn with a web scraping project.

What I am trying to do is extract the job posting title, job number, and job URL.

I am unable to capture the URL for each job posting for a few sites that use the same vendor.

 

Example:  https://blackknight.wd1.myworkdayjobs.com/BKC

 

I only see 2 options:

  1. Right click on each job posting title, copy URL, and paste to Excel spreadsheet (in same row as the job posting name and number).
  2. Click on each job posting title, which would open a new tab, and extract page URL of job description (as well as job posting title and job number).

 

I can't seem to figure out how to left or right click each job posting - tried using a for loop unsuccessfully.

Are either of these options possible? Or is there a better solution?

 

I appreciate any assistance/ideas.

 

Thanks,

PJ

1 ACCEPTED SOLUTION

Accepted Solutions
PAG
Helper I
Helper I

Here is a solution if you don't mind using JavaScript
https://power.automate.gallery/s7KioqNuknHe

The challenge was to get the url, since it will be added to DOM after right-clicking the item. This solution uses JavaScript to get the job done.

View solution in original post

12 REPLIES 12
MichaelAnnis
Resident Rockstar
Resident Rockstar

I am not at my computer, but isn’t there a “Get URL” command that’s relevant to either a web instance or a window?

 

I feel like I have done this before, but can’t remember exactly how.  

 

Another option.  After running your code that has the active browser instance variable, find the instance in the variable list and right click to view. Right click the 3 dots and click view.  Here, you should see properties that resemble a browser instance.  Do any of these properties represent the URL? If so, you can access this with the format %BrowserInstance.URL% if .URL was the property. 

Best of Luck!

I am unable to use the "Get URL" command as the URL does not appear in the HTML code.

It will only appear if I right click on each job posting link and 'copy URL' which it will then show in the HTML code.

 

Is there a way to do that in Power Automate?

Hi pjwilson87

 

I've drawn up a simple example, which works for me. Give it a try:

extractURL.png

You'll have to add the extraction of the Job ID and Title yourself, but that should be pretty simple with the loop in place, just adding some "Get details of a UI element" and tweaking the selectors.

 

I don't know if this works, but here's a copy of the flow you can try and insert into PAD: https://codeshare.io/6pmol0

 

The flow extracts the total number of posts at the top, sends a few "End"-clicks, to load the entire list, then loops through the list, right clicking on each element and clicking "Copy URL".

You should include the supplementing extracts in the loop as well as writing the values to you table.

 

Give it a try and let me know if you have ant issues or questions.

Daniel_Pa: Pay attention if you copy-paste flow actions to codeshare.io, it will reveal all flow images (could be a screenshot of your Windows desktop etc.).

We have created a service https://power.automate.gallery for this purpose. Our service automatically wipes out all your images when you copy-paste a flow. Maybe you could try it?

It seems to give me an error each time on Line 8 (Focus text field in window).

 

Not sure if this is possible, but would it be easier to extract from JSON?

 

I see the URLs for each job posting in 'commandLink':

 

JSON - commandLink.png

 

Is it possible to scrape the 'commandLink' for each job posting? If so, would that be more simplified?

Maybe this is what @MichaelAnnis was referring to, but not sure how to do that.

PAG
Helper I
Helper I

Here is a solution if you don't mind using JavaScript
https://power.automate.gallery/s7KioqNuknHe

The challenge was to get the url, since it will be added to DOM after right-clicking the item. This solution uses JavaScript to get the job done.

View solution in original post

shindomo
Continued Contributor
Continued Contributor

Hello @pjwilson87 

 

You can use the action "Invoke web service" with GET method to get the query results as JSON on this particular website.

 

shindomo_2-1633307465416.png

 

After you converted from JSON to Custom Object in PAD, you can retrieve the commandLink element in the custom object as below:

%JsonAsCustomObject['body']['children'][0]['children'][0]['listItems'][0]['title']['commandLink']%

 

Please replace the index number of "listItems" with your loopIndex (position of highlighted in blue).

 

I have created the following sample flow, and it works fine:

Web.InvokeWebService Url: $'''https://blackknight.wd1.myworkdayjobs.com/BKC''' Method: Web.Method.Get Accept: $'''application/json''' ContentType: $'''application/json''' ConnectionTimeout: 30 FollowRedirection: True ClearCookies: False FailOnErrorStatus: False EncodeRequestBody: True UserAgent: $'''Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.21) Gecko/20100312 Firefox/3.6''' Encoding: Web.Encoding.AutoDetect AcceptUntrustedCertificates: False ResponseHeaders=> WebServiceResponseHeaders Response=> WebServiceResponse StatusCode=> StatusCode
Variables.ConvertJsonToCustomObject Json: WebServiceResponse CustomObject=> JsonAsCustomObject
SET BaseUrl TO $'''https://blackknight.wd1.myworkdayjobs.com'''
Variables.CreateNewList List=> commandLinkUrl
LOOP LoopIndex FROM 0 TO 49 STEP 1
    Variables.AddItemToList Item: BaseUrl + JsonAsCustomObject['body']['children'][0]['children'][0]['listItems'][LoopIndex]['title']['commandLink'] List: commandLinkUrl NewList=> commandLinkUrl
    DISABLE Display.ShowMessage Message: JsonAsCustomObject['body']['children'][0]['children'][0]['listItems'][LoopIndex]['title']['commandLink'] Icon: Display.Icon.None Buttons: Display.Buttons.OK DefaultButton: Display.DefaultButton.Button1 IsTopMost: False ButtonPressed=> ButtonPressed
END

 

shindomo_0-1633307293628.png

 

As a result, you get a list of URL as below screenshot. 🙂

 

shindomo_1-1633307380664.png

 

Please try it.

Thank you.

@shindomo - I think I am missing something simple in my 'Add item to list' because it seems to pull in the whole JSON and does not look like your screen shot (which would be great).

 

I noticed in my loop, within the 'Add item to list' has an apostrophe before my brackets after 'JsonAsCustomObject'. In your screen shot, you do not have these apostrophes (including before/after LoopIndex within the brackets).

What am i missing?

 

Add item to list.png

 

Thanks @PAG & @shindomo for the replies! These are helpful!

pjwilson87
Frequent Visitor

Hey @PAG

 

Is it possible to search 'remote' in the search box of the website, click search button, and then scrape everything?

In addition, adding another column with the fixed value 'remote' for each row?

shindomo
Continued Contributor
Continued Contributor

Hello @pjwilson87 

 

We can not use percentage signs (%) nested. Leave only the outermost pair of percentage signs (%).

 

In the following syntax, variable names are written in blue text, but it is not necessary to enclose each variable in percent signs (%).

 

%BaseUrl + JsonAsCustomObject['body']['children'][0]['children'][0]['listItems'][LoopIndex]['title']['commandLink']%

 

shindomo_0-1634423952000.png

 

See the following document (although it doesn't go into much detail):

Variable manipulation and the % notation - Power Automate | Microsoft Docs

 

Thank you.

@shindomo Is it possible to scrape all the results this way (the page has infinite scrolling) or would this method not work for that?

 

Thanks for the previous fix! That worked perfect. I came across that document but still could not figure it out.

shindomo
Continued Contributor
Continued Contributor

Hello @pjwilson87 

 

There is a paging mechanism implemented in this site to retrieve 50 items at a time, but I don't know anything more about it because the specification for generating GET requests is not clear.

 

As far as I can tell from F12 DevTool, if you scroll down and reach the bottom of the page, it fires GET request to fetch additional data from web server, until it reachs the count of search query results.

 

[0]

https://blackknight.wd1.myworkdayjobs.com/BKC/?clientRequestID=d018b6384bd84ecebcdb3c49ccdc96bd
[1]

https://blackknight.wd1.myworkdayjobs.com/BKC/fs/searchPagination/318c8bb6f553100021d223d9780d30be/50?clientRequestID=4fa7a76c3ccf47ed8e5dfe7e3f7a08a0
[2]

https://blackknight.wd1.myworkdayjobs.com/BKC/20/searchPagination/318c8bb6f553100021d223d9780d30be/100?clientRequestID=a2129477633e4502a44ef43339d9125f
[3]

https://blackknight.wd1.myworkdayjobs.com/BKC/21/searchPagination/318c8bb6f553100021d223d9780d30be/150?clientRequestID=afa4a6476a2648aa81c328e46bc20497

 

I can tell it calls "searchPagination" method, however, I can not tell how to generate parts of red characters above. Also "clientRequestID" is changing every time it fires GET request.

 

Since this issue has nothing related with PAD itself, perhaps you should find another place to discuss this particular topic. 🙂

 

Thank you.

Helpful resources

Announcements
UG GA Amplification 768x460.png

Launching new user group features

Learn how to create your own user groups today!

Community Connections 768x460.jpg

Community & How To Videos

Check out the new Power Platform Community Connections gallery!

Top Solution Authors
Users online (2,162)