cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
MisterH
Helper I
Helper I

REGEX Capture Groups via Script - Javascript or Python

Hi Everyone,

 

I need to run a REGEX through a large text file (very badly formatted CSV with oodles of issues in parsing) and return the results to PAD. The only way I can see to achieve this is via either JS or Python script. I am currently working with a JS implementation using the following:

 

Set Variable called %REGEX% = ^(?<SKU>[\\d\\w]+),\\d*,\"?(?<Description>.+)\"?,\\$(?<Price>[\\d\\.]+)

Read Text from File into %FileContents%

Run Javascript&colon;

var csv = %FileContents%;
var reg = %RegEx%;
var out = csv.matchAll(reg);
WScript.Echo(out);

 

This does not work 😞

 

I receive the error (truncated) --> Microsoft JScript compilation error: Expected ';'

 

Can anyone point me in the right direction to solving this? I'm guessing that it is something syntactic with the REGEX itself but I am not sure how to debug this particular scenario.

 

Any help greatly appreciated.

 

Cheers

 

MisterH

 

1 ACCEPTED SOLUTION

Accepted Solutions

Hi Everyone,

 

I have found a way to get this really dirty data from a CSV into PAD in a clean way that takes only seconds, using the scripting engine capabilities.

 

The way to achieve a result is as follows:

  1. Select the filename (with it's path) that you want to process
  2. Set a variable with the REGEX pattern you want to use to extract the data from the file
  3. Using the Python Script step run the REGEX on the file and return the result
  4. Get a temporary filename
  5. Export / Save the extracted data, as text, to the temporary file
  6. Load the data from the temporary file with the CSV step to get a 'clean' read of the information
  7. Delete the temporary file since it's no longer needed

In my case the specific Python script I am using is the following (this is some really dirty data with loads of special characters in it, extra commas and quote marks, you name it - the REGEX pattern does the work of grabbing the correct chunks of text from each line):

import re								#Import the REGEX Engine

r = r'''%RegEx%'''	                                         	#Get PAD variable RegEx (raw text)
f = '''%CSVFile%'''                                         	#Get PAD variable CSVFile (the data file)

#The REGEX pattern must be compiled. Regex being used is      ^([\d\w-]+),[\d\w /]*,"?(.+[^"])(?:"+)?,\$([\d\.]+),.*$
p = re.compile(r, re.MULTILINE)                   	#Compile the RegEx as a MULTILINE pattern

with open(f, 'r') as file:                                  	#Open the CSVFile for reading
    txt = file.read()                                         	#And read the entire contents into 'txt'

m = p.findall(txt)                                           	#Match the MULTILINE RegEx to capture all groups

OUT = '"SKU"|||"DES"|||"BUY"\n'			#Set the header row
for idx,tup in enumerate(m):
	SKU = tup[0]							#Get the SKU
	DES = re.sub('[\'\"\+\&\\-\.\(\)\*/\,#=]',' ',tup[1])	#Extract a clean description
	DES = re.sub(' +', ' ', DES)				#Remove multiple whitespaces from cleanup
	BUY = tup[2]							#Get the price
	OUT = OUT + '"' + SKU + '"|||"' + DES + '"|||' + BUY + '\n'		#Add a new line of data
	
OUT = OUT.strip('\n')						#Remove trailing newline character from last line
print(OUT)								#return the result to PAD

 

The entire PAD process looks like this:

 

20220713_111812_PAD.Designer_v5OLFRDh08.png

 

I hope that this helps anyone trying to deal with large quantities of poorly formatted CSV or TXT data. The process is relatively quick too - about 40 seconds for 16k+ lines.

 

Cheers

 

MisterH

View solution in original post

17 REPLIES 17
MisterH
Helper I
Helper I

PS - The REGEX pattern works perfectly for it's intended task (created and tested in REGEX Buddy)

In javascript every line should end with a semicolon ;

So correct that and if still facing issues then post the original script here to check for errors.

VJR
Super User
Super User

I think after re-reading now I understood your above code better.

The first three lines are your PAD actions and the rest is your Javascript code? Is it?

Also check out "Parse Text" which has an option to turn on regex and generate the output.

Hi VJR,

 

Yes, first three lines are the steps in PAD.

1/ Set a variable to hold the REGEX pattern

2/ Read the text 'raw' from the file

3/ Run the JavaScript step below

 

The remaining lines are the JS itself, with %FileContents% and %RegEx% as the PAD variables holding their respective data, and being passed from PAD to the JS step.

var csv = %FileContents%;
var reg = %RegEx%;
var out = csv.matchAll(reg);
WScript.Echo(out);

 

 Sorry for the lack of clarity previously. I hope this explains things a little better.

 

MisterH

No issues, have you checked out the Parse Text action I have mentioned above. It has an in-built option where you can pass the regex and get the output.

PS (again) - the Parse Text step is going to be very inefficient for processing the data. Each one of these files has more than 16k lines, and there are going to be a continuous stream of them all poorly formatted due to the process that produces them - beyond my control I'm afraid. The number of data points I need to extract per line is 3 (as you can see from the capture groups in the REGEX), however handling this in a single step would be a vast improvement and in theory should be possible with either JS or Python from what I can tell. I'm just not sure how to correctly pass in the data for it to work.

I read this message in your original post but could not co-relate it with the code written in JS.

Didn't co-relate the second time too. Its not about your explanation but about me :).

 

I think below is the same equivalent of what you are trying to achieve via the javascript code.

When you disable the "First occurrence only", the matches variable will return multiple matches.

Post back if it much more complicated.

 

VJR_0-1657003634781.png

 

Here is a sample data set with some of the 'cleaner' data that is available in one of the files:

 

ACEFOBKIT,,"CS ACE WIEGAND 1 DOOR KIT, INC CS4890 CONTROLLER, HID READER & 10 X SEOS FOBS",$123.00,1 Door Kits,CST,
ACEFOBKITSTART,458922514,"CS ACE WIEGAND 1 DOOR KIT, INC CS4890 CONTROLLER, PROG KEYPAD HID READER & 10 X SEOS FOBS",$123.00,1 Door Kits,CST,
CS4836,,CS iKEY 1 DOOR KIT 4836 with 10 iKEYS,$123.00,1 Door Kits,CST,
CS4828,,CS iKEY 2 DOOR KIT 4828 with 10 iKEYS,$123.00,2 & 4 Door Kits,CST,

 

and yes, there is a comma(,) at the end of every line.

As a side note to my above screenshot on Parse Text, to access PAD variables in a script enclose quotes around it

 

var csv = "%FileContents%";

Parse Text can do multi-match, but it doesn't seem to handle multiple capture groups and simply returns the string (entire length) between where the match starts and finishes. I need to specifically extract the capture groups, which would mean running Parse Text 3 times per line of text. This means on a 16K line file there are 48k steps inside of a for each loop, giving a total of 80k steps in total. Horrifically inefficient.

Getting a strange error with this --> JScript compilation error: Unterminated string constant

 

I wonder if this is due to the raw text containing quotation marks?

I did not get any error after running the javascript, not did it return any results.

 

Moreover, on testing the regex with the text you shared it is not returning anything 

 

https://regex101.com/

 

VJR_0-1657006474379.png

It does, you need to unescape the double-escaped slashes.

 @MisterH, please refer @mscheetham's suggestion regarding the regular expression that you found to be working in REGEX Buddy.

MisterH
Helper I
Helper I

Ok Everyone,

 

Thanks for having a crack at this. Here's an update with a little more research and experimentation done:

  • The JavaScript step is not going to be able to run the necessary code due to how badly munted the CSV data actually is. The contents of the file are poor and no amount of cleaning will guarantee a viable result.
  • The JavaScript step will bomb-out at the point of taking the PAD variable ('FileContents') due to how malformed the data is - this is the cause of most of the errors

Switching to the Python step instead:

  • Updated the REGEX to a Python 2.7 standard (no named capture groups)
  • Had to Parse/Replace any and all double quote marks in the 'FileContents' with 'nothing'
  • Updated Python code runs without error, however I am struggling to now output the results (see below)
import re
txt = r"""%FileContents%"""
reg = """%RegEx%"""
p = re.compile(reg)
m=p.match(txt)
print(m)

 

I am guessing that I am going to need to do something with the matches to output them in another data type to get the results back into PAD. When I try to use the regex string as a test output I am also not getting anything back - this is a simple print statement for the variable 'reg' in the code.

 

Any ideas?

 

MisterH

Hi VJR,

 

I am sorry that the copy / paste from Windows into this forum 'escaped' the slash characters. I did not see that when I made the original post. My apologies.

 

In the running process the regex is not 'escaped' and is as suggested it should be. Now it has had a few minor updates to accommodate some more of the vargaries that this file keeps coming up with - an endless set of pitfalls and traps it seems.

 

^([\d\w-]+),[\d\w /]*,"?(.+[^"])(?:"+)?,\$([\d\.]+),.*$

 

There is no way for JS to handle the raw text of the file, so I have switched to Python to try and handle it. I am thinking it is going to have to be built solely in Python with an output being provided back to PAD.

1/ Get the filename to process

2/ Pass the filename into the Python Script step

3/ Have Python load and process the file using REGEX

4/ Have Python construct a suitable output format / variable to return to PAD

5/ Give the result to PAD and continue with the next steps

 

I'll update as I go. I hope that this might be helpful to others when I get it finished.

 

MisterH

Hi Everyone,

 

I have found a way to get this really dirty data from a CSV into PAD in a clean way that takes only seconds, using the scripting engine capabilities.

 

The way to achieve a result is as follows:

  1. Select the filename (with it's path) that you want to process
  2. Set a variable with the REGEX pattern you want to use to extract the data from the file
  3. Using the Python Script step run the REGEX on the file and return the result
  4. Get a temporary filename
  5. Export / Save the extracted data, as text, to the temporary file
  6. Load the data from the temporary file with the CSV step to get a 'clean' read of the information
  7. Delete the temporary file since it's no longer needed

In my case the specific Python script I am using is the following (this is some really dirty data with loads of special characters in it, extra commas and quote marks, you name it - the REGEX pattern does the work of grabbing the correct chunks of text from each line):

import re								#Import the REGEX Engine

r = r'''%RegEx%'''	                                         	#Get PAD variable RegEx (raw text)
f = '''%CSVFile%'''                                         	#Get PAD variable CSVFile (the data file)

#The REGEX pattern must be compiled. Regex being used is      ^([\d\w-]+),[\d\w /]*,"?(.+[^"])(?:"+)?,\$([\d\.]+),.*$
p = re.compile(r, re.MULTILINE)                   	#Compile the RegEx as a MULTILINE pattern

with open(f, 'r') as file:                                  	#Open the CSVFile for reading
    txt = file.read()                                         	#And read the entire contents into 'txt'

m = p.findall(txt)                                           	#Match the MULTILINE RegEx to capture all groups

OUT = '"SKU"|||"DES"|||"BUY"\n'			#Set the header row
for idx,tup in enumerate(m):
	SKU = tup[0]							#Get the SKU
	DES = re.sub('[\'\"\+\&\\-\.\(\)\*/\,#=]',' ',tup[1])	#Extract a clean description
	DES = re.sub(' +', ' ', DES)				#Remove multiple whitespaces from cleanup
	BUY = tup[2]							#Get the price
	OUT = OUT + '"' + SKU + '"|||"' + DES + '"|||' + BUY + '\n'		#Add a new line of data
	
OUT = OUT.strip('\n')						#Remove trailing newline character from last line
print(OUT)								#return the result to PAD

 

The entire PAD process looks like this:

 

20220713_111812_PAD.Designer_v5OLFRDh08.png

 

I hope that this helps anyone trying to deal with large quantities of poorly formatted CSV or TXT data. The process is relatively quick too - about 40 seconds for 16k+ lines.

 

Cheers

 

MisterH

Helpful resources

Announcements
Power Automate News & Announcements

Power Automate News & Announcements

Keep up to date with current events and community announcements in the Power Automate community.

Power Automate Community Blog

Power Automate Community Blog

Check out the latest Community Blog from the community!

Top Solution Authors
Top Kudoed Authors
Users online (3,290)