SSIS - XML Source Script
- by simonsabin
The XML Source in SSIS is great if you have a 1 to 1
mapping between entity and table. You can do more complex mapping but it becomes
very messy and won't perform. What other options do you have?
The challenge with XML processing is to not need a huge amount of memory. I
remember using the early versions of Biztalk with loaded the whole document into
memory to map from one document type to another. This was fine for small
documents but was an absolute killer for large documents.
You therefore need a streaming approach.
For flexibility however you want to be able to generate your rows easily, and
if you've ever used the XmlReader you will know its ugly code to write.
That brings me on to LINQ. The is an implementation of LINQ over XML which is
really nice. You can write nice LINQ queries instead of the XMLReader stuff. The
downside is that by default LINQ to XML requires a whole XML document to work
with. No streaming.
Your code would look like this. We create an XDocument and then enumerate
over a set of annoymous types we generate from our LINQ statement
XDocument x =
XDocument.Load("C:\\TEMP\\CustomerOrders-Attribute.xml");
foreach (var xdata in (from customer in
x.Elements("OrderInterface").Elements("Customer")
from order in customer.Elements("Orders").Elements("Order")
select new { Account = customer.Attribute("AccountNumber").Value
,
OrderDate = order.Attribute("OrderDate").Value }
))
{
Output0Buffer.AddRow();
Output0Buffer.AccountNumber
= xdata.Account;
Output0Buffer.OrderDate =
Convert.ToDateTime(xdata.OrderDate);
}
As I said the downside to this is that you are loading the whole document
into memory.
I did some googling and came across some helpful videos from a nice UK DPE
Mike Taulty http://www.microsoft.com/uk/msdn/screencasts/screencast/289/LINQ-to-XML-Streaming-In-Large-Documents.aspx.
Which show you how you can combine LINQ and the XmlReader to get a semi
streaming approach. I took what he did and implemented it in SSIS. What I found
odd was that when I ran it I got different numbers between using the
streamed and non streamed versions. I found the cause was a little bug in Mikes
code that causes the pointer in the XmlReader to progress past the start of the
element and thus
foreach (var xdata in (from customer in
StreamReader("C:\\TEMP\\CustomerOrders-Attribute.xml","Customer")
from order in customer.Elements("Orders").Elements("Order")
select new { Account = customer.Attribute("AccountNumber").Value
, OrderDate = order.Attribute("OrderDate").Value }
))
{
Output0Buffer.AddRow();
Output0Buffer.AccountNumber = xdata.Account;
Output0Buffer.OrderDate = Convert.ToDateTime(xdata.OrderDate);
}
These look very similiar and they are the key element is the method we are
calling, StreamReader. This method is what gives us streaming, what it does is
return a enumerable list of elements, because of the way that LINQ works this
results in the data being streamed in.
static IEnumerable<XElement> StreamReader(String filename, string elementName)
{
using (XmlReader
xr = XmlReader.Create(filename))
{
xr.MoveToContent();
while (xr.Read()) //Reads the first
element
{
while (xr.NodeType == XmlNodeType.Element && xr.Name ==
elementName)
{
XElement node = (XElement)XElement.ReadFrom(xr);
yield return node;
}
}
xr.Close();
}
}
This code is specifically designed to return a list of the elements with a
specific name. The first Read reads the root element and then the inner while
loop checks to see if the current element is the type we want. If not we do the
xr.Read() again until we find the element type we want. We then use the neat
function XElement.ReadFrom to read an element and all its sub elements into an
XElement. This is what is returned and can be consumed by the LINQ statement.
Essentially once one element has been read we need to check if we are still on
the same element type and name (the inner loop) This was Mikes mistake, if we
called .Read again we would advance the XmlReader beyond the start of the
Element and so the ReadFrom method wouldn't work.
So with the code above you can use what ever LINQ statement you like to
flatten your XML into the rowsets you want. You could even have multiple outputs
and generate your own surrogate keys.