Separating text strings into a table of individual words in SQL via XML.
- by Phil Factor
p.MsoNormal
{margin-top:0cm;
margin-right:0cm;
margin-bottom:10.0pt;
margin-left:0cm;
line-height:115%;
font-size:11.0pt;
font-family:"Calibri","sans-serif";
}
Nearly nine years ago, Mike Rorke of the SQL Server 2005 XML team blogged ‘Querying Over Constructed XML
Using Sub-queries’. I remember reading it at the time without being able to think of a use for what he was
demonstrating. Just a few weeks ago, whilst preparing my article on searching strings, I got out my trusty function for
splitting strings into words and something reminded me of the old blog.
I’d been trying to think of a way of using XML to split strings reliably into words. The routine I
devised turned out to be slightly slower than the iterative word chop I’ve always used in the past, so I didn’t publish
it. It was then I suddenly remembered the old routine. Here is my version of it.
I’ve unwrapped it from its obvious home in a function or procedure just so it is easy to
appreciate. What it does is to chop a text string into individual words using XQuery and the good old nodes() method.
I’ve benchmarked it and it is quicker than any of the SQL ways of doing it that I know about. Obviously, you can’t use
the trick I described here to do it, because it is awkward to use REPLACE() on 1…n characters of whitespace. I’ll
carry on using my iterative function since it is able to tell me the location of each word as a character-offset from
the start, and also because this method leaves punctuation in (removing it takes time!). However, I can see other uses
for this in passing lists as input or output parameters, or as return values.
if
exists (Select
* from sys.xml_schema_collections
where name like 'WordList')
drop
XML SCHEMA COLLECTION
WordList
go
create
xml schema collection
WordList as '
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="words">
<xs:simpleType>
<xs:list itemType="xs:string"
/>
</xs:simpleType>
</xs:element>
</xs:schema>'
go
DECLARE
@string VARCHAR(MAX)
–we'll get some sample data from the great Ogden Nash
Select
@String='This is a song to celebrate banks,
Because they are full of money and you go into them and all
you hear is clinks and clanks,
Or maybe a sound like the wind in the trees on the hills,
Which is the rustling of the thousand dollar bills.
Most bankers dwell in marble halls,
Which they get to dwell in because they encourage deposits
and discourage withdrawals,
And particularly because they all observe one rule which woe
betides the banker who fails to heed it,
Which is you must never lend any money to anybody unless
they don''t need it.
I know you, you cautious conservative banks!
If people are worried about their rent it is your duty to deny
them the loan of one nickel, yes, even one copper engraving
of the martyred son of the late Nancy Hanks;
Yes, if they request fifty dollars to pay for a baby you must
look at them like Tarzan looking at an uppity ape in the
jungle,
And tell them what do they think a bank is, anyhow, they had
better go get the money from their wife''s aunt or ungle.
But suppose people come in and they have a million and they
want another million to pile on top of it,
Why, you brim with the milk of human kindness and you
urge them to accept every drop of it,
And you lend them the million so then they have two million
and this gives them the idea that they would be better off
with four,
So they already have two million as security so you have no
hesitation in lending them two more,
And all the vice-presidents nod their heads in rhythm,
And the only question asked is do the borrowers want the
money sent or do they want to take it withm.
Because I think they deserve our appreciation and thanks,
the jackasses who go around saying that health and happi-
ness are everything and money isn''t essential,
Because as soon as they have to borrow some unimportant
money to maintain their health and happiness they starve
to death so they can''t go around any more sneering at good
old money, which is nothing short of providential. '
–we now turn it into XML
declare
@xml_data xml(WordList)
set
@xml_data='<words>'+
replace(@string,'&',
'&')+'</words>'
select
T.ref.value('.',
'nvarchar(100)')
from
(Select @xml_data.query('
for $i in data(/words) return
element li { $i }
')) A(list)
cross
apply A.List.nodes('/li')
T(ref)
…which gives (truncated, of course)…