Matching a Repeating Sub Series using a Regular Expression with PowerShell

Posted by Hinch on Stack Overflow See other posts from Stack Overflow or by Hinch
Published on 2010-05-18T10:18:15Z Indexed on 2010/05/18 11:21 UTC
Read the original article Hit count: 382

Filed under:
|

I have a text file that lists the names of a large number of Excel spreadsheets, and the names of the files that are linked to from the spreadsheets.

In simplified form it looks like this:

"Parent
 File1.xls"

  Link: ChildFileA.xls
  Link: ChildFileB.xls
"ParentFile2.xls"
"ParentFile3.xls"
  Blah
  Link: ChildFileC.xls
  Link: ChildFileD.xls
  More Junk
  Link: ChildFileE.xls
"Parent
 File4.xls"
  Link: ChildFileF.xls

In this example, ParentFile1.xls has embedded links to ChildFileA.xls and ChildFileB.xls, ParentFile2.xls has no embedded links, and ParentFile3.xls has 3 embedded links.

I am trying to write a regular expression in PowerShell that will parse the text file producing output in the following form:

ParentFile1.xls:ChildFileA.xls,ChildFileB.xls
ParentFile3.xls:ChildFileC.xls,ChildFileD.xls,ChildFileE.xls
etc

The task is complicated by the fact that the text file contains a lot of junk between each of the lines, and a parent may not always have a child. Furthermore, a single file name may pass over multiple lines. However, it's not as bad as it sounds, as the parent and child file names are always clearly demarcated (the parent with quotes and the child with a prefix of Link: ).

The PowerShell code I've been using is as follows:

$content = [string]::Join([environment]::NewLine, (Get-Content C:\Temp\text.txt))
$regex = [regex]'(?im)\s*\"(.*)\r?\n?\s*(.*)\"[\s\S]*?Link: (.*)\r?\n?'
$regex.Matches($content) | %{$_.Groups[1].Value + $_.Groups[2].Value + ":" + $_.Groups[3].Value}

Using the example above, it outputs:

ParentFile1.xls:ChildFileA.xls
ParentFile2.xls""ParentFile3.xls:ChildFileC.xls
ParentFile4.xls:ChildFileF.xls

There are two issues. Firstly, the inclusion of the "" instead of a newline whenever a Parent without a Child is processed. And the second issue, which is the most important, is that only a single child is ever shown for each parent. I'm guessing I need to somehow recursively capture and display the multiple child links that exist for each parent, but I'm totally stumped as to how to do this with a regular expression.

Amy help would be greatly appreciated. The file contains 100's of thousands of lines, and manual processing is not an option :)

© Stack Overflow or respective owner

Related posts about regex

Related posts about powershell