Matching a Repeating Sub Series using a Regular Expression with PowerShell
- by Hinch
I have a text file that lists the names of a large number of Excel spreadsheets, and the names of the files that are linked to from the spreadsheets.
In simplified form it looks like this:
"Parent
File1.xls"
Link: ChildFileA.xls
Link: ChildFileB.xls
"ParentFile2.xls"
"ParentFile3.xls"
Blah
Link: ChildFileC.xls
Link: ChildFileD.xls
More Junk
Link: ChildFileE.xls
"Parent
File4.xls"
Link: ChildFileF.xls
In this example, ParentFile1.xls has embedded links to ChildFileA.xls and ChildFileB.xls, ParentFile2.xls has no embedded links, and ParentFile3.xls has 3 embedded links.
I am trying to write a regular expression in PowerShell that will parse the text file producing output in the following form:
ParentFile1.xls:ChildFileA.xls,ChildFileB.xls
ParentFile3.xls:ChildFileC.xls,ChildFileD.xls,ChildFileE.xls
etc
The task is complicated by the fact that the text file contains a lot of junk between each of the lines, and a parent may not always have a child. Furthermore, a single file name may pass over multiple lines. However, it's not as bad as it sounds, as the parent and child file names are always clearly demarcated (the parent with quotes and the child with a prefix of Link: ).
The PowerShell code I've been using is as follows:
$content = [string]::Join([environment]::NewLine, (Get-Content C:\Temp\text.txt))
$regex = [regex]'(?im)\s*\"(.*)\r?\n?\s*(.*)\"[\s\S]*?Link: (.*)\r?\n?'
$regex.Matches($content) | %{$_.Groups[1].Value + $_.Groups[2].Value + ":" + $_.Groups[3].Value}
Using the example above, it outputs:
ParentFile1.xls:ChildFileA.xls
ParentFile2.xls""ParentFile3.xls:ChildFileC.xls
ParentFile4.xls:ChildFileF.xls
There are two issues. Firstly, the inclusion of the "" instead of a newline whenever a Parent without a Child is processed. And the second issue, which is the most important, is that only a single child is ever shown for each parent. I'm guessing I need to somehow recursively capture and display the multiple child links that exist for each parent, but I'm totally stumped as to how to do this with a regular expression.
Amy help would be greatly appreciated. The file contains 100's of thousands of lines, and manual processing is not an option :)