Use the power of regular expressions to cleanse your data right there inside the Data Flow. This transformation includes a full user interface for simple configuration, as well as advanced features such as error output configuration.
Two regular expressions are used, a match expression and a replace expression. The transformation is designed around the named capture groups or match groups, and even supports multiple expressions. This allows for rich and complex expressions to be built, all through an easy to reuse transformation where a bespoke Script Component was previously the only alternative.
Some simple properties are available for each column selected –
Behaviour
The two behaviour modes offer similar functionality but with a difference. Replace, replaces tokens with the input, and Emit overwrites the whole string.
Cascade
Cascade allows you to define multiple expressions, each on a new line. The match expression will be processed into one operation per line, which are then processed in order at run-time. Multiple replace expressions can also be specified, again each on a new line. If there is no corresponding replace expression for a match expression line, then the last replace expression will be used instead. It is common to have multiple match expressions, but only a single replace expression.
Match Expression
The expression used to define the named capture groups. This is where you can analyse the data, and tag or name elements within it as found by the match expression.
Replace Expression
The replace determines the final output. It will reference the named groups from the match expression and assembles them into the final output.
If you want to use regular expressions to validate data then try the Regular Expression Transformation.
Quick Start Guide
Select a column. A new output column is created for each selected column; there is no option for in-place replacement of column values. One input column can be used to populate multiple output columns, just select the column again in the lower grid, using the Input Columns drop-down selector.
Amend the output column name and size as required. They default to the same as the input column selected.
Amend the behaviour as required, the default is Replace.
Amend the cascade option as required, the default is true.
Finally enter your match and replace regular expressions
Quick Sample #1
Parse an email address and extract the user and domain portions. Format as a web address passing the user portion as a URL parameter. This uses two match groups, user and host, which correspond to the text before the @ and after it respectively.
Behaviour is Emit, and cascade of false, we only have a single match expression.
Match Expression ^(?<user>[^@]+)@(?<host>.+)$
Replace Expression - http://www.${host}?user=${user}
Results
Sample Input
Sample Output
[email protected]
http://www.adventure-works.com?user=zheng0
The component is provided as an MSI file, however to complete the installation, you will have to add the transformation to the Visual Studio toolbox manually. Right-click the toolbox, and select Choose Items.... Select the
SSIS Data Flow Items tab, and then check the RegexClean Transformation from the list.
Downloads
The RegexClean Transformation is available for both SQL Server 2005 and SQL Server 2008. Please choose the version to match your SQL Server version, or you can install both versions and use them side by side if you have both SQL Server 2005 and SQL Server 2008 installed.
RegexClean Transformation for SQL Server 2005
RegexClean Transformation for SQL Server 2008
Version History
SQL Server 2005
Version 1.0.0.105 - Public Release
(28 Jan 2008)
SQL Server 2005
Version 1.0.0.105 - Public Release
(28 Jan 2008)
Screenshot