Hi Everyone,
I need to run a REGEX through a large text file (very badly formatted CSV with oodles of issues in parsing) and return the results to PAD. The only way I can see to achieve this is via either JS or Python script. I am currently working with a JS implementation using the following:
Set Variable called %REGEX% = ^(?<SKU>[\\d\\w]+),\\d*,\"?(?<Description>.+)\"?,\\$(?<Price>[\\d\\.]+)
Read Text from File into %FileContents%
Run Javascript:
var csv = %FileContents%;
var reg = %RegEx%;
var out = csv.matchAll(reg);
WScript.Echo(out);
This does not work 😞
I receive the error (truncated) --> Microsoft JScript compilation error: Expected ';'
Can anyone point me in the right direction to solving this? I'm guessing that it is something syntactic with the REGEX itself but I am not sure how to debug this particular scenario.
Any help greatly appreciated.
Cheers
MisterH
Solved! Go to Solution.
Hi Everyone,
I have found a way to get this really dirty data from a CSV into PAD in a clean way that takes only seconds, using the scripting engine capabilities.
The way to achieve a result is as follows:
In my case the specific Python script I am using is the following (this is some really dirty data with loads of special characters in it, extra commas and quote marks, you name it - the REGEX pattern does the work of grabbing the correct chunks of text from each line):
import re #Import the REGEX Engine
r = r'''%RegEx%''' #Get PAD variable RegEx (raw text)
f = '''%CSVFile%''' #Get PAD variable CSVFile (the data file)
#The REGEX pattern must be compiled. Regex being used is ^([\d\w-]+),[\d\w /]*,"?(.+[^"])(?:"+)?,\$([\d\.]+),.*$
p = re.compile(r, re.MULTILINE) #Compile the RegEx as a MULTILINE pattern
with open(f, 'r') as file: #Open the CSVFile for reading
txt = file.read() #And read the entire contents into 'txt'
m = p.findall(txt) #Match the MULTILINE RegEx to capture all groups
OUT = '"SKU"|||"DES"|||"BUY"\n' #Set the header row
for idx,tup in enumerate(m):
SKU = tup[0] #Get the SKU
DES = re.sub('[\'\"\+\&\\-\.\(\)\*/\,#=]',' ',tup[1]) #Extract a clean description
DES = re.sub(' +', ' ', DES) #Remove multiple whitespaces from cleanup
BUY = tup[2] #Get the price
OUT = OUT + '"' + SKU + '"|||"' + DES + '"|||' + BUY + '\n' #Add a new line of data
OUT = OUT.strip('\n') #Remove trailing newline character from last line
print(OUT) #return the result to PAD
The entire PAD process looks like this:
I hope that this helps anyone trying to deal with large quantities of poorly formatted CSV or TXT data. The process is relatively quick too - about 40 seconds for 16k+ lines.
Cheers
MisterH
PS - The REGEX pattern works perfectly for it's intended task (created and tested in REGEX Buddy)
In javascript every line should end with a semicolon ;
So correct that and if still facing issues then post the original script here to check for errors.
I think after re-reading now I understood your above code better.
The first three lines are your PAD actions and the rest is your Javascript code? Is it?
Also check out "Parse Text" which has an option to turn on regex and generate the output.
Hi VJR,
Yes, first three lines are the steps in PAD.
1/ Set a variable to hold the REGEX pattern
2/ Read the text 'raw' from the file
3/ Run the JavaScript step below
The remaining lines are the JS itself, with %FileContents% and %RegEx% as the PAD variables holding their respective data, and being passed from PAD to the JS step.
var csv = %FileContents%;
var reg = %RegEx%;
var out = csv.matchAll(reg);
WScript.Echo(out);
Sorry for the lack of clarity previously. I hope this explains things a little better.
MisterH
No issues, have you checked out the Parse Text action I have mentioned above. It has an in-built option where you can pass the regex and get the output.
PS (again) - the Parse Text step is going to be very inefficient for processing the data. Each one of these files has more than 16k lines, and there are going to be a continuous stream of them all poorly formatted due to the process that produces them - beyond my control I'm afraid. The number of data points I need to extract per line is 3 (as you can see from the capture groups in the REGEX), however handling this in a single step would be a vast improvement and in theory should be possible with either JS or Python from what I can tell. I'm just not sure how to correctly pass in the data for it to work.
I read this message in your original post but could not co-relate it with the code written in JS.
Didn't co-relate the second time too. Its not about your explanation but about me :).
I think below is the same equivalent of what you are trying to achieve via the javascript code.
When you disable the "First occurrence only", the matches variable will return multiple matches.
Post back if it much more complicated.
Here is a sample data set with some of the 'cleaner' data that is available in one of the files:
ACEFOBKIT,,"CS ACE WIEGAND 1 DOOR KIT, INC CS4890 CONTROLLER, HID READER & 10 X SEOS FOBS",$123.00,1 Door Kits,CST,
ACEFOBKITSTART,458922514,"CS ACE WIEGAND 1 DOOR KIT, INC CS4890 CONTROLLER, PROG KEYPAD HID READER & 10 X SEOS FOBS",$123.00,1 Door Kits,CST,
CS4836,,CS iKEY 1 DOOR KIT 4836 with 10 iKEYS,$123.00,1 Door Kits,CST,
CS4828,,CS iKEY 2 DOOR KIT 4828 with 10 iKEYS,$123.00,2 & 4 Door Kits,CST,
and yes, there is a comma(,) at the end of every line.
As a side note to my above screenshot on Parse Text, to access PAD variables in a script enclose quotes around it
var csv = "%FileContents%";
Parse Text can do multi-match, but it doesn't seem to handle multiple capture groups and simply returns the string (entire length) between where the match starts and finishes. I need to specifically extract the capture groups, which would mean running Parse Text 3 times per line of text. This means on a 16K line file there are 48k steps inside of a for each loop, giving a total of 80k steps in total. Horrifically inefficient.
Getting a strange error with this --> JScript compilation error: Unterminated string constant
I wonder if this is due to the raw text containing quotation marks?
I did not get any error after running the javascript, not did it return any results.
Moreover, on testing the regex with the text you shared it is not returning anything
It does, you need to unescape the double-escaped slashes.
@MisterH, please refer @mscheetham's suggestion regarding the regular expression that you found to be working in REGEX Buddy.
Ok Everyone,
Thanks for having a crack at this. Here's an update with a little more research and experimentation done:
Switching to the Python step instead:
import re
txt = r"""%FileContents%"""
reg = """%RegEx%"""
p = re.compile(reg)
m=p.match(txt)
print(m)
I am guessing that I am going to need to do something with the matches to output them in another data type to get the results back into PAD. When I try to use the regex string as a test output I am also not getting anything back - this is a simple print statement for the variable 'reg' in the code.
Any ideas?
MisterH
Hi VJR,
I am sorry that the copy / paste from Windows into this forum 'escaped' the slash characters. I did not see that when I made the original post. My apologies.
In the running process the regex is not 'escaped' and is as suggested it should be. Now it has had a few minor updates to accommodate some more of the vargaries that this file keeps coming up with - an endless set of pitfalls and traps it seems.
^([\d\w-]+),[\d\w /]*,"?(.+[^"])(?:"+)?,\$([\d\.]+),.*$
There is no way for JS to handle the raw text of the file, so I have switched to Python to try and handle it. I am thinking it is going to have to be built solely in Python with an output being provided back to PAD.
1/ Get the filename to process
2/ Pass the filename into the Python Script step
3/ Have Python load and process the file using REGEX
4/ Have Python construct a suitable output format / variable to return to PAD
5/ Give the result to PAD and continue with the next steps
I'll update as I go. I hope that this might be helpful to others when I get it finished.
MisterH
Hi Everyone,
I have found a way to get this really dirty data from a CSV into PAD in a clean way that takes only seconds, using the scripting engine capabilities.
The way to achieve a result is as follows:
In my case the specific Python script I am using is the following (this is some really dirty data with loads of special characters in it, extra commas and quote marks, you name it - the REGEX pattern does the work of grabbing the correct chunks of text from each line):
import re #Import the REGEX Engine
r = r'''%RegEx%''' #Get PAD variable RegEx (raw text)
f = '''%CSVFile%''' #Get PAD variable CSVFile (the data file)
#The REGEX pattern must be compiled. Regex being used is ^([\d\w-]+),[\d\w /]*,"?(.+[^"])(?:"+)?,\$([\d\.]+),.*$
p = re.compile(r, re.MULTILINE) #Compile the RegEx as a MULTILINE pattern
with open(f, 'r') as file: #Open the CSVFile for reading
txt = file.read() #And read the entire contents into 'txt'
m = p.findall(txt) #Match the MULTILINE RegEx to capture all groups
OUT = '"SKU"|||"DES"|||"BUY"\n' #Set the header row
for idx,tup in enumerate(m):
SKU = tup[0] #Get the SKU
DES = re.sub('[\'\"\+\&\\-\.\(\)\*/\,#=]',' ',tup[1]) #Extract a clean description
DES = re.sub(' +', ' ', DES) #Remove multiple whitespaces from cleanup
BUY = tup[2] #Get the price
OUT = OUT + '"' + SKU + '"|||"' + DES + '"|||' + BUY + '\n' #Add a new line of data
OUT = OUT.strip('\n') #Remove trailing newline character from last line
print(OUT) #return the result to PAD
The entire PAD process looks like this:
I hope that this helps anyone trying to deal with large quantities of poorly formatted CSV or TXT data. The process is relatively quick too - about 40 seconds for 16k+ lines.
Cheers
MisterH