SSIS - XML Source Script
- by simonsabin
The XML Source in SSIS is great if you have a 1 to 1 
mapping between entity and table. You can do more complex mapping but it becomes 
very messy and won't perform. What other options do you have?
The challenge with XML processing is to not need a huge amount of memory. I 
remember using the early versions of Biztalk with loaded the whole document into 
memory to map from one document type to another. This was fine for small 
documents but was an absolute killer for large documents.
You therefore need a streaming approach.
For flexibility however you want to be able to generate your rows easily, and 
if you've ever used the XmlReader you will know its ugly code to write.
That brings me on to LINQ. The is an implementation of LINQ over XML which is 
really nice. You can write nice LINQ queries instead of the XMLReader stuff. The 
downside is that by default LINQ to XML requires a whole XML document to work 
with. No streaming.
Your code would look like this. We create an XDocument and then enumerate 
over a set of annoymous types we generate from our LINQ statement
XDocument x = 
XDocument.Load("C:\\TEMP\\CustomerOrders-Attribute.xml");
 
foreach (var xdata in (from customer in 
x.Elements("OrderInterface").Elements("Customer")
                       
from order in customer.Elements("Orders").Elements("Order")
                       
select new { Account = customer.Attribute("AccountNumber").Value
                                  , 
OrderDate = order.Attribute("OrderDate").Value }
                       
))
{
    
Output0Buffer.AddRow();
    Output0Buffer.AccountNumber 
= xdata.Account;
    Output0Buffer.OrderDate = 
Convert.ToDateTime(xdata.OrderDate);
}
As I said the downside to this is that you are loading the whole document 
into memory.
I did some googling and came across some helpful videos from a nice UK DPE 
Mike Taulty http://www.microsoft.com/uk/msdn/screencasts/screencast/289/LINQ-to-XML-Streaming-In-Large-Documents.aspx. 
Which show you how you can combine LINQ and the XmlReader to get a semi 
streaming approach. I took what he did and implemented it in SSIS. What I found 
odd was that when I ran it I got different numbers between using the 
streamed and non streamed versions. I found the cause was a little bug in Mikes 
code that causes the pointer in the XmlReader to progress past the start of the 
element and thus
foreach (var xdata in (from customer in 
StreamReader("C:\\TEMP\\CustomerOrders-Attribute.xml","Customer")
                               
from order in customer.Elements("Orders").Elements("Order")
                               
select new { Account = customer.Attribute("AccountNumber").Value
                                          
, OrderDate = order.Attribute("OrderDate").Value }
                               
))
        
{
            
Output0Buffer.AddRow();
            
Output0Buffer.AccountNumber = xdata.Account;
            
Output0Buffer.OrderDate = Convert.ToDateTime(xdata.OrderDate);
        
}
These look very similiar and they are the key element is the method we are 
calling, StreamReader. This method is what gives us streaming, what it does is 
return a enumerable list of elements, because of the way that LINQ works this 
results in the data being streamed in.
static IEnumerable<XElement> StreamReader(String filename, string elementName)
{
    using (XmlReader 
xr = XmlReader.Create(filename))
    {
        
xr.MoveToContent();
        
while (xr.Read()) //Reads the first 
element
        
{
            
while (xr.NodeType == XmlNodeType.Element && xr.Name == 
elementName)
            
{
                
XElement node = (XElement)XElement.ReadFrom(xr);
 
                
yield return node;
            
}
        
}
        
xr.Close();
    }
}
This code is specifically designed to return a list of the elements with a 
specific name. The first Read reads the root element and then the inner while 
loop checks to see if the current element is the type we want. If not we do the 
xr.Read() again until we find the element type we want. We then use the neat 
function XElement.ReadFrom to read an element and all its sub elements into an 
XElement. This is what is returned and can be consumed by the LINQ statement. 
Essentially once one element has been read we need to check if we are still on 
the same element type and name (the inner loop) This was Mikes mistake, if we 
called .Read again we would advance the XmlReader beyond the start of the 
Element and so the ReadFrom method wouldn't work.
So with the code above you can use what ever LINQ statement you like to 
flatten your XML into the rowsets you want. You could even have multiple outputs 
and generate your own surrogate keys.