Hey PAD folks,
I am new to PAD and trying to learn with a web scraping project.
What I am trying to do is extract the job posting title, job number, and job URL.
I am unable to capture the URL for each job posting for a few sites that use the same vendor.
I only see 2 options:
I can't seem to figure out how to left or right click each job posting - tried using a for loop unsuccessfully.
Are either of these options possible? Or is there a better solution?
I appreciate any assistance/ideas.
Solved! Go to Solution.
I am not at my computer, but isn’t there a “Get URL” command that’s relevant to either a web instance or a window?
I feel like I have done this before, but can’t remember exactly how.
Another option. After running your code that has the active browser instance variable, find the instance in the variable list and right click to view. Right click the 3 dots and click view. Here, you should see properties that resemble a browser instance. Do any of these properties represent the URL? If so, you can access this with the format %BrowserInstance.URL% if .URL was the property.
Best of Luck!
I am unable to use the "Get URL" command as the URL does not appear in the HTML code.
It will only appear if I right click on each job posting link and 'copy URL' which it will then show in the HTML code.
Is there a way to do that in Power Automate?
I've drawn up a simple example, which works for me. Give it a try:
You'll have to add the extraction of the Job ID and Title yourself, but that should be pretty simple with the loop in place, just adding some "Get details of a UI element" and tweaking the selectors.
I don't know if this works, but here's a copy of the flow you can try and insert into PAD: https://codeshare.io/6pmol0
The flow extracts the total number of posts at the top, sends a few "End"-clicks, to load the entire list, then loops through the list, right clicking on each element and clicking "Copy URL".
You should include the supplementing extracts in the loop as well as writing the values to you table.
Give it a try and let me know if you have ant issues or questions.
Daniel_Pa: Pay attention if you copy-paste flow actions to codeshare.io, it will reveal all flow images (could be a screenshot of your Windows desktop etc.).
We have created a service https://power.automate.gallery for this purpose. Our service automatically wipes out all your images when you copy-paste a flow. Maybe you could try it?
It seems to give me an error each time on Line 8 (Focus text field in window).
Not sure if this is possible, but would it be easier to extract from JSON?
I see the URLs for each job posting in 'commandLink':
Is it possible to scrape the 'commandLink' for each job posting? If so, would that be more simplified?
Maybe this is what @MichaelAnnis was referring to, but not sure how to do that.
You can use the action "Invoke web service" with GET method to get the query results as JSON on this particular website.
After you converted from JSON to Custom Object in PAD, you can retrieve the commandLink element in the custom object as below:
Please replace the index number of "listItems" with your loopIndex (position of highlighted in blue).
I have created the following sample flow, and it works fine:
Web.InvokeWebService Url: $'''https://blackknight.wd1.myworkdayjobs.com/BKC''' Method: Web.Method.Get Accept: $'''application/json''' ContentType: $'''application/json''' ConnectionTimeout: 30 FollowRedirection: True ClearCookies: False FailOnErrorStatus: False EncodeRequestBody: True UserAgent: $'''Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:188.8.131.52) Gecko/20100312 Firefox/3.6''' Encoding: Web.Encoding.AutoDetect AcceptUntrustedCertificates: False ResponseHeaders=> WebServiceResponseHeaders Response=> WebServiceResponse StatusCode=> StatusCode Variables.ConvertJsonToCustomObject Json: WebServiceResponse CustomObject=> JsonAsCustomObject SET BaseUrl TO $'''https://blackknight.wd1.myworkdayjobs.com''' Variables.CreateNewList List=> commandLinkUrl LOOP LoopIndex FROM 0 TO 49 STEP 1 Variables.AddItemToList Item: BaseUrl + JsonAsCustomObject['body']['children']['children']['listItems'][LoopIndex]['title']['commandLink'] List: commandLinkUrl NewList=> commandLinkUrl DISABLE Display.ShowMessage Message: JsonAsCustomObject['body']['children']['children']['listItems'][LoopIndex]['title']['commandLink'] Icon: Display.Icon.None Buttons: Display.Buttons.OK DefaultButton: Display.DefaultButton.Button1 IsTopMost: False ButtonPressed=> ButtonPressed END
As a result, you get a list of URL as below screenshot. 🙂
Please try it.
@shindomo - I think I am missing something simple in my 'Add item to list' because it seems to pull in the whole JSON and does not look like your screen shot (which would be great).
I noticed in my loop, within the 'Add item to list' has an apostrophe before my brackets after 'JsonAsCustomObject'. In your screen shot, you do not have these apostrophes (including before/after LoopIndex within the brackets).
What am i missing?
Is it possible to search 'remote' in the search box of the website, click search button, and then scrape everything?
In addition, adding another column with the fixed value 'remote' for each row?
We can not use percentage signs (%) nested. Leave only the outermost pair of percentage signs (%).
In the following syntax, variable names are written in blue text, but it is not necessary to enclose each variable in percent signs (%).
%BaseUrl + JsonAsCustomObject['body']['children']['children']['listItems'][LoopIndex]['title']['commandLink']%
See the following document (although it doesn't go into much detail):
@shindomo Is it possible to scrape all the results this way (the page has infinite scrolling) or would this method not work for that?
Thanks for the previous fix! That worked perfect. I came across that document but still could not figure it out.
There is a paging mechanism implemented in this site to retrieve 50 items at a time, but I don't know anything more about it because the specification for generating GET requests is not clear.
As far as I can tell from F12 DevTool, if you scroll down and reach the bottom of the page, it fires GET request to fetch additional data from web server, until it reachs the count of search query results.
I can tell it calls "searchPagination" method, however, I can not tell how to generate parts of red characters above. Also "clientRequestID" is changing every time it fires GET request.
Since this issue has nothing related with PAD itself, perhaps you should find another place to discuss this particular topic. 🙂