cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
MisterH
Helper I
Helper I

REGEX Capture Groups via Script - Javascript or Python

Hi Everyone,

 

I need to run a REGEX through a large text file (very badly formatted CSV with oodles of issues in parsing) and return the results to PAD. The only way I can see to achieve this is via either JS or Python script. I am currently working with a JS implementation using the following:

 

Set Variable called %REGEX% = ^(?<SKU>[\\d\\w]+),\\d*,\"?(?<Description>.+)\"?,\\$(?<Price>[\\d\\.]+)

Read Text from File into %FileContents%

Run Javascript&colon;

var csv = %FileContents%;
var reg = %RegEx%;
var out = csv.matchAll(reg);
WScript.Echo(out);

 

This does not work 😞

 

I receive the error (truncated) --> Microsoft JScript compilation error: Expected ';'

 

Can anyone point me in the right direction to solving this? I'm guessing that it is something syntactic with the REGEX itself but I am not sure how to debug this particular scenario.

 

Any help greatly appreciated.

 

Cheers

 

MisterH

 

1 ACCEPTED SOLUTION

Accepted Solutions

Hi Everyone,

 

I have found a way to get this really dirty data from a CSV into PAD in a clean way that takes only seconds, using the scripting engine capabilities.

 

The way to achieve a result is as follows:

  1. Select the filename (with it's path) that you want to process
  2. Set a variable with the REGEX pattern you want to use to extract the data from the file
  3. Using the Python Script step run the REGEX on the file and return the result
  4. Get a temporary filename
  5. Export / Save the extracted data, as text, to the temporary file
  6. Load the data from the temporary file with the CSV step to get a 'clean' read of the information
  7. Delete the temporary file since it's no longer needed

In my case the specific Python script I am using is the following (this is some really dirty data with loads of special characters in it, extra commas and quote marks, you name it - the REGEX pattern does the work of grabbing the correct chunks of text from each line):

import re								#Import the REGEX Engine

r = r'''%RegEx%'''	                                         	#Get PAD variable RegEx (raw text)
f = '''%CSVFile%'''                                         	#Get PAD variable CSVFile (the data file)

#The REGEX pattern must be compiled. Regex being used is      ^([\d\w-]+),[\d\w /]*,"?(.+[^"])(?:"+)?,\$([\d\.]+),.*$
p = re.compile(r, re.MULTILINE)                   	#Compile the RegEx as a MULTILINE pattern

with open(f, 'r') as file:                                  	#Open the CSVFile for reading
    txt = file.read()                                         	#And read the entire contents into 'txt'

m = p.findall(txt)                                           	#Match the MULTILINE RegEx to capture all groups

OUT = '"SKU"|||"DES"|||"BUY"\n'			#Set the header row
for idx,tup in enumerate(m):
	SKU = tup[0]							#Get the SKU
	DES = re.sub('[\'\"\+\&\\-\.\(\)\*/\,#=]',' ',tup[1])	#Extract a clean description
	DES = re.sub(' +', ' ', DES)				#Remove multiple whitespaces from cleanup
	BUY = tup[2]							#Get the price
	OUT = OUT + '"' + SKU + '"|||"' + DES + '"|||' + BUY + '\n'		#Add a new line of data
	
OUT = OUT.strip('\n')						#Remove trailing newline character from last line
print(OUT)								#return the result to PAD

 

The entire PAD process looks like this:

 

20220713_111812_PAD.Designer_v5OLFRDh08.png

 

I hope that this helps anyone trying to deal with large quantities of poorly formatted CSV or TXT data. The process is relatively quick too - about 40 seconds for 16k+ lines.

 

Cheers

 

MisterH

View solution in original post

17 REPLIES 17
MisterH
Helper I
Helper I

PS - The REGEX pattern works perfectly for it's intended task (created and tested in REGEX Buddy)

VJR
Multi Super User
Multi Super User

In javascript every line should end with a semicolon ;

So correct that and if still facing issues then post the original script here to check for errors.

VJR
Multi Super User
Multi Super User

I think after re-reading now I understood your above code better.

The first three lines are your PAD actions and the rest is your Javascript code? Is it?

Also check out "Parse Text" which has an option to turn on regex and generate the output.

Hi VJR,

 

Yes, first three lines are the steps in PAD.

1/ Set a variable to hold the REGEX pattern

2/ Read the text 'raw' from the file

3/ Run the JavaScript step below

 

The remaining lines are the JS itself, with %FileContents% and %RegEx% as the PAD variables holding their respective data, and being passed from PAD to the JS step.

var csv = %FileContents%;
var reg = %RegEx%;
var out = csv.matchAll(reg);
WScript.Echo(out);

 

 Sorry for the lack of clarity previously. I hope this explains things a little better.

 

MisterH

VJR
Multi Super User
Multi Super User

No issues, have you checked out the Parse Text action I have mentioned above. It has an in-built option where you can pass the regex and get the output.

PS (again) - the Parse Text step is going to be very inefficient for processing the data. Each one of these files has more than 16k lines, and there are going to be a continuous stream of them all poorly formatted due to the process that produces them - beyond my control I'm afraid. The number of data points I need to extract per line is 3 (as you can see from the capture groups in the REGEX), however handling this in a single step would be a vast improvement and in theory should be possible with either JS or Python from what I can tell. I'm just not sure how to correctly pass in the data for it to work.

VJR
Multi Super User
Multi Super User

I read this message in your original post but could not co-relate it with the code written in JS.

Didn't co-relate the second time too. Its not about your explanation but about me :).

 

I think below is the same equivalent of what you are trying to achieve via the javascript code.

When you disable the "First occurrence only", the matches variable will return multiple matches.

Post back if it much more complicated.

 

VJR_0-1657003634781.png

 

Here is a sample data set with some of the 'cleaner' data that is available in one of the files:

 

ACEFOBKIT,,"CS ACE WIEGAND 1 DOOR KIT, INC CS4890 CONTROLLER, HID READER & 10 X SEOS FOBS",$123.00,1 Door Kits,CST,
ACEFOBKITSTART,458922514,"CS ACE WIEGAND 1 DOOR KIT, INC CS4890 CONTROLLER, PROG KEYPAD HID READER & 10 X SEOS FOBS",$123.00,1 Door Kits,CST,
CS4836,,CS iKEY 1 DOOR KIT 4836 with 10 iKEYS,$123.00,1 Door Kits,CST,
CS4828,,CS iKEY 2 DOOR KIT 4828 with 10 iKEYS,$123.00,2 & 4 Door Kits,CST,

 

and yes, there is a comma(,) at the end of every line.

VJR
Multi Super User
Multi Super User

As a side note to my above screenshot on Parse Text, to access PAD variables in a script enclose quotes around it

 

var csv = "%FileContents%";

Parse Text can do multi-match, but it doesn't seem to handle multiple capture groups and simply returns the string (entire length) between where the match starts and finishes. I need to specifically extract the capture groups, which would mean running Parse Text 3 times per line of text. This means on a 16K line file there are 48k steps inside of a for each loop, giving a total of 80k steps in total. Horrifically inefficient.

Getting a strange error with this --> JScript compilation error: Unterminated string constant

 

I wonder if this is due to the raw text containing quotation marks?

VJR
Multi Super User
Multi Super User

I did not get any error after running the javascript, not did it return any results.

 

Moreover, on testing the regex with the text you shared it is not returning anything 

 

https://regex101.com/

 

VJR_0-1657006474379.png

It does, you need to unescape the double-escaped slashes.

 @MisterH, please refer @mscheetham's suggestion regarding the regular expression that you found to be working in REGEX Buddy.

MisterH
Helper I
Helper I

Ok Everyone,

 

Thanks for having a crack at this. Here's an update with a little more research and experimentation done:

  • The JavaScript step is not going to be able to run the necessary code due to how badly munted the CSV data actually is. The contents of the file are poor and no amount of cleaning will guarantee a viable result.
  • The JavaScript step will bomb-out at the point of taking the PAD variable ('FileContents') due to how malformed the data is - this is the cause of most of the errors

Switching to the Python step instead:

  • Updated the REGEX to a Python 2.7 standard (no named capture groups)
  • Had to Parse/Replace any and all double quote marks in the 'FileContents' with 'nothing'
  • Updated Python code runs without error, however I am struggling to now output the results (see below)
import re
txt = r"""%FileContents%"""
reg = """%RegEx%"""
p = re.compile(reg)
m=p.match(txt)
print(m)

 

I am guessing that I am going to need to do something with the matches to output them in another data type to get the results back into PAD. When I try to use the regex string as a test output I am also not getting anything back - this is a simple print statement for the variable 'reg' in the code.

 

Any ideas?

 

MisterH

Hi VJR,

 

I am sorry that the copy / paste from Windows into this forum 'escaped' the slash characters. I did not see that when I made the original post. My apologies.

 

In the running process the regex is not 'escaped' and is as suggested it should be. Now it has had a few minor updates to accommodate some more of the vargaries that this file keeps coming up with - an endless set of pitfalls and traps it seems.

 

^([\d\w-]+),[\d\w /]*,"?(.+[^"])(?:"+)?,\$([\d\.]+),.*$

 

There is no way for JS to handle the raw text of the file, so I have switched to Python to try and handle it. I am thinking it is going to have to be built solely in Python with an output being provided back to PAD.

1/ Get the filename to process

2/ Pass the filename into the Python Script step

3/ Have Python load and process the file using REGEX

4/ Have Python construct a suitable output format / variable to return to PAD

5/ Give the result to PAD and continue with the next steps

 

I'll update as I go. I hope that this might be helpful to others when I get it finished.

 

MisterH

Hi Everyone,

 

I have found a way to get this really dirty data from a CSV into PAD in a clean way that takes only seconds, using the scripting engine capabilities.

 

The way to achieve a result is as follows:

  1. Select the filename (with it's path) that you want to process
  2. Set a variable with the REGEX pattern you want to use to extract the data from the file
  3. Using the Python Script step run the REGEX on the file and return the result
  4. Get a temporary filename
  5. Export / Save the extracted data, as text, to the temporary file
  6. Load the data from the temporary file with the CSV step to get a 'clean' read of the information
  7. Delete the temporary file since it's no longer needed

In my case the specific Python script I am using is the following (this is some really dirty data with loads of special characters in it, extra commas and quote marks, you name it - the REGEX pattern does the work of grabbing the correct chunks of text from each line):

import re								#Import the REGEX Engine

r = r'''%RegEx%'''	                                         	#Get PAD variable RegEx (raw text)
f = '''%CSVFile%'''                                         	#Get PAD variable CSVFile (the data file)

#The REGEX pattern must be compiled. Regex being used is      ^([\d\w-]+),[\d\w /]*,"?(.+[^"])(?:"+)?,\$([\d\.]+),.*$
p = re.compile(r, re.MULTILINE)                   	#Compile the RegEx as a MULTILINE pattern

with open(f, 'r') as file:                                  	#Open the CSVFile for reading
    txt = file.read()                                         	#And read the entire contents into 'txt'

m = p.findall(txt)                                           	#Match the MULTILINE RegEx to capture all groups

OUT = '"SKU"|||"DES"|||"BUY"\n'			#Set the header row
for idx,tup in enumerate(m):
	SKU = tup[0]							#Get the SKU
	DES = re.sub('[\'\"\+\&\\-\.\(\)\*/\,#=]',' ',tup[1])	#Extract a clean description
	DES = re.sub(' +', ' ', DES)				#Remove multiple whitespaces from cleanup
	BUY = tup[2]							#Get the price
	OUT = OUT + '"' + SKU + '"|||"' + DES + '"|||' + BUY + '\n'		#Add a new line of data
	
OUT = OUT.strip('\n')						#Remove trailing newline character from last line
print(OUT)								#return the result to PAD

 

The entire PAD process looks like this:

 

20220713_111812_PAD.Designer_v5OLFRDh08.png

 

I hope that this helps anyone trying to deal with large quantities of poorly formatted CSV or TXT data. The process is relatively quick too - about 40 seconds for 16k+ lines.

 

Cheers

 

MisterH

Helpful resources

Announcements

Exclusive LIVE Community Event: Power Apps Copilot Coffee Chat with Copilot Studio Product Team

It's time for the SECOND Power Apps Copilot Coffee Chat featuring the Copilot Studio product team, which will be held LIVE on April 3, 2024 at 9:30 AM Pacific Daylight Time (PDT).     This is an incredible opportunity to connect with members of the Copilot Studio product team and ask them anything about Copilot Studio. We'll share our special guests with you shortly--but we want to encourage to mark your calendars now because you will not want to miss the conversation.   This live event will give you the unique opportunity to learn more about Copilot Studio plans, where we’ll focus, and get insight into upcoming features. We’re looking forward to hearing from the community, so bring your questions!   TO GET ACCESS TO THIS EXCLUSIVE AMA: Kudo this post to reserve your spot! Reserve your spot now by kudoing this post.  Reservations will be prioritized on when your kudo for the post comes through, so don't wait! Click that "kudo button" today.   Invitations will be sent on April 2nd.Users posting Kudos after April 2nd at 9AM PDT may not receive an invitation but will be able to view the session online after conclusion of the event. Give your "kudo" today and mark your calendars for April 3, 2024 at 9:30 AM PDT and join us for an engaging and informative session!

Tuesday Tip: Unlocking Community Achievements and Earning Badges

TUESDAY TIPS are our way of communicating helpful things we've learned or shared that have helped members of the Community. Whether you're just getting started or you're a seasoned pro, Tuesday Tips will help you know where to go, what to look for, and navigate your way through the ever-growing--and ever-changing--world of the Power Platform Community! We cover basics about the Community, provide a few "insider tips" to make your experience even better, and share best practices gleaned from our most active community members and Super Users.   With so many new Community members joining us each week, we'll also review a few of our "best practices" so you know just "how" the Community works, so make sure to watch the News & Announcements each week for the latest and greatest Tuesday Tips!     THIS WEEK'S TIP: Unlocking Achievements and Earning BadgesAcross the Communities, you'll see badges on users profile that recognize and reward their engagement and contributions. These badges each signify a different achievement--and all of those achievements are available to any Community member! If you're a seasoned pro or just getting started, you too can earn badges for the great work you do. Check out some details on Community badges below--and find out more in the detailed link at the end of the article!       A Diverse Range of Badges to Collect The badges you can earn in the Community cover a wide array of activities, including: Kudos Received: Acknowledges the number of times a user’s post has been appreciated with a “Kudo.”Kudos Given: Highlights the user’s generosity in recognizing others’ contributions.Topics Created: Tracks the number of discussions initiated by a user.Solutions Provided: Celebrates the instances where a user’s response is marked as the correct solution.Reply: Counts the number of times a user has engaged with community discussions.Blog Contributor: Honors those who contribute valuable content and are invited to write for the community blog.       A Community Evolving Together Badges are not only a great way to recognize outstanding contributions of our amazing Community members--they are also a way to continue fostering a collaborative and supportive environment. As you continue to share your knowledge and assist each other these badges serve as a visual representation of your valuable contributions.   Find out more about badges in these Community Support pages in each Community: All About Community Badges - Power Apps CommunityAll About Community Badges - Power Automate CommunityAll About Community Badges - Copilot Studio CommunityAll About Community Badges - Power Pages Community

Tuesday Tips: Powering Up Your Community Profile

TUESDAY TIPS are our way of communicating helpful things we've learned or shared that have helped members of the Community. Whether you're just getting started or you're a seasoned pro, Tuesday Tips will help you know where to go, what to look for, and navigate your way through the ever-growing--and ever-changing--world of the Power Platform Community! We cover basics about the Community, provide a few "insider tips" to make your experience even better, and share best practices gleaned from our most active community members and Super Users.   With so many new Community members joining us each week, we'll also review a few of our "best practices" so you know just "how" the Community works, so make sure to watch the News & Announcements each week for the latest and greatest Tuesday Tips!   This Week's Tip: Power Up Your Profile!  🚀 It's where every Community member gets their start, and it's essential that you keep it updated! Your Community User Profile is how you're able to get messages, post solutions, ask questions--and as you rank up, it's where your badges will appear and how you'll be known when you start blogging in the Community Blog. Your Community User Profile is how the Community knows you--so it's essential that it works the way you need it to! From changing your username to updating contact information, this Knowledge Base Article is your best resource for powering up your profile.     Password Puzzles? No Problem! Find out how to sync your Azure AD password with your community account, ensuring a seamless sign-in. No separate passwords to remember! Job Jumps & Email Swaps Changed jobs? Got a new email? Fear not! You'll find out how to link your shiny new email to your existing community account, keeping your contributions and connections intact. Username Uncertainties Unraveled Picking the perfect username is crucial--and sometimes the original choice you signed up with doesn't fit as well as you may have thought. There's a quick way to request an update here--but remember, your username is your community identity, so choose wisely. "Need Admin Approval" Warning Window? If you see this error message while using the community, don't worry. A simple process will help you get where you need to go. If you still need assistance, find out how to contact your Community Support team. Whatever you're looking for, when it comes to your profile, the Community Account Support Knowledge Base article is your treasure trove of tips as you navigate the nuances of your Community Profile. It’s the ultimate resource for keeping your digital identity in tip-top shape while engaging with the Power Platform Community. So, dive in and power up your profile today!  💪🚀   Community Account Support | Power Apps Community Account Support | Power AutomateCommunity Account Support | Copilot Studio  Community Account Support | Power Pages

Super User of the Month | Chris Piasecki

In our 2nd installment of this new ongoing feature in the Community, we're thrilled to announce that Chris Piasecki is our Super User of the Month for March 2024. If you've been in the Community for a while, we're sure you've seen a comment or marked one of Chris' helpful tips as a solution--he's been a Super User for SEVEN consecutive seasons!   Since authoring his first reply in April 2020 to his most recent achievement organizing the Canadian Power Platform Summit this month, Chris has helped countless Community members with his insights and expertise. In addition to being a Super User, Chris is also a User Group leader, Microsoft MVP, and a featured speaker at the Microsoft Power Platform Conference. His contributions to the new SUIT program, along with his joyous personality and willingness to jump in and help so many members has made Chris a fixture in the Power Platform Community.   When Chris isn't authoring solutions or organizing events, he's actively leading Piasecki Consulting, specializing in solution architecture, integration, DevOps, and more--helping clients discover how to strategize and implement Microsoft's technology platforms. We are grateful for Chris' insightful help in the Community and look forward to even more amazing milestones as he continues to assist so many with his great tips, solutions--always with a smile and a great sense of humor.You can find Chris in the Community and on LinkedIn. Thanks for being such a SUPER user, Chris! 💪 🌠  

Find Out What Makes Super Users So Super

We know many of you visit the Power Platform Communities to ask questions and receive answers. But do you know that many of our best answers and solutions come from Community members who are super active, helping anyone who needs a little help getting unstuck with Business Applications products? We call these dedicated Community members Super Users because they are the real heroes in the Community, willing to jump in whenever they can to help! Maybe you've encountered them yourself and they've solved some of your biggest questions. Have you ever wondered, "Why?"We interviewed several of our Super Users to understand what drives them to help in the Community--and discover the difference it has made in their lives as well! Take a look in our gallery today: What Motivates a Super User? - Power Platform Community (microsoft.com)

March User Group Update: New Groups and Upcoming Events!

  Welcome to this month’s celebration of our Community User Groups and exciting User Group events. We’re thrilled to introduce some brand-new user groups that have recently joined our vibrant community. Plus, we’ve got a lineup of engaging events you won’t want to miss. Let’s jump right in: New User Groups   Sacramento Power Platform GroupANZ Power Platform COE User GroupPower Platform MongoliaPower Platform User Group OmanPower Platform User Group Delta StateMid Michigan Power Platform Upcoming Events  DUG4MFG - Quarterly Meetup - Microsoft Demand PlanningDate: 19 Mar 2024 | 10:30 AM to 12:30 PM Central America Standard TimeDescription: Dive into the world of manufacturing with a focus on Demand Planning. Learn from industry experts and share your insights. Dynamics User Group HoustonDate: 07 Mar 2024 | 11:00 AM to 01:00 PM Central America Standard TimeDescription: Houston, get ready for an immersive session on Dynamics 365 and the Power Platform. Connect with fellow professionals and expand your knowledge. Reading Dynamics 365 & Power Platform User Group (Q1)Date: 05 Mar 2024 | 06:00 PM to 09:00 PM GMT Standard TimeDescription: Join our virtual meetup for insightful discussions, demos, and community updates. Let’s kick off Q1 with a bang! Leaders, Create Your Events!  Leaders of existing User Groups, don’t forget to create your events within the Community platform. By doing so, you’ll enable us to share them in future posts and newsletters. Let’s spread the word and make these gatherings even more impactful! Stay tuned for more updates, inspiring stories, and collaborative opportunities from and for our Community User Groups.   P.S. Have an event or success story to share? Reach out to us – we’d love to feature you!

Users online (5,317)