Parsing a UTF-16 encoded xml file in ruby with REXML
- by Matthew Toohey
Hello, I'm trying to parse the following UTF-16 encoded xml file in REXML:
http://www.abc.net.au/triplej/feeds/playout/triplejsydneyplayout.xml?_523525
REXML encounters an error after the following:
>> require 'rexml/document'
=> true
>> include REXML
=> Object
>> require 'net/http'
=> true
>> triplejString = Net::HTTP.get('www.abc.net.au', '/triplej/feeds/playout/triplejsydneyplayout.xml?_523525')
=> "\377\376<\000?\000x\000m\000l\000 \000v\000e\000r\000s\000i\000o\000n\000=\000\"\0001\000.\0000\000\"\000 \000e\000n\000c\000o\000d\000i\000n\000g\000=\000\"\000u\000t\000f\000-\0001\0006\000\"\000?\000>\000<\000a\000b\000c\000m\000u\000s\000i\000c\000_\000p\000l\000a\000y\000o\000u\000t\000>\000<\000c\000h\000a\000n\000n\000e\000l\000>\000J\000J\000J\000<\000/\000c\000h\000a\000n\000n\000e\000l\000>\000<\000p\000u\000b\000l\000i\000s\000h\000t\000i\000m\000e\000>\000F\000r\000i\000,\000 \0003\0000\000 \000A\000p\000r\000 \0002\0000\0001\0000\000 \0001\0001\000:\0005\0007\000:\0001\0007\000 \000G\000M\000T\000<\000/\000p\000u\000b\000l\000i\000s\000h\000t\000i\000m\000e\000>\000<\000i\000t\000e\000m\000s\000>\000<\000i\000t\000e\000m\000>\000<\000p\000l\000a\000y\000i\000n\000g\000>\000n\000o\000w\000<\000/\000p\000l\000a\000y\000i\000n\000g\000>\000<\000t\000i\000t\000l\000e\000>\000D\000o\000c\000t\000o\000r\000,\000 \000D\000o\000c\000t\000o\000r\000<\000/\000t\000i\000t\000l\000e\000>\000<\000t\000r\000a\000c\000k\000i\000d\000>\000<\000/\000t\000r\000a\000c\000k\000i\000d\000>\000<\000p\000l\000a\000y\000e\000d\000t\000i\000m\000e\000>\000F\000r\000i\000,\000 \0003\0000\000 \000A\000p\000r\000 \0002\0000\0001\0000\000 \0001\0001\000:\0005\0007\000:\0001\0007\000 \000G\000M\000T\000<\000/\000p\000l\000a\000y\000e\000d\000t\000i\000m\000e\000>\000<\000p\000u\000b\000l\000i\000s\000h\000e\000r\000>\000<\000/\000p\000u\000b\000l\000i\000s\000h\000e\000r\000>\000<\000d\000a\000t\000e\000c\000o\000p\000y\000r\000i\000g\000h\000t\000e\000d\000>\0002\0000\0000\0003\000<\000/\000d\000a\000t\000e\000c\000o\000p\000y\000r\000i\000g\000h\000t\000e\000d\000>\000<\000d\000u\000r\000a\000t\000i\000o\000n\000>\0001\0006\0003\000<\000/\000d\000u\000r\000a\000t\000i\000o\000n\000>\000<\000a\000u\000s\000t\000>\000N\000o\000<\000/\000a\000u\000s\000t\000>\000<\000t\000r\000a\000c\000k\000n\000o\000t\000e\000>\000<\000/\000t\000r\000a\000c\000k\000n\000o\000t\000e\000>\000<\000t\000r\000a\000c\000k\000l\000i\000n\000k\000>\000<\000/\000t\000r\000a\000c\000k\000l\000i\000n\000k\000>\000<\000s\000h\000o\000w\000>\000<\000/\000s\000h\000o\000w\000>\000<\000t\000a\000l\000e\000n\000t\000>\000<\000/\000t\000a\000l\000e\000n\000t\000>\000<\000a\000l\000b\000u\000m\000>\000<\000a\000l\000b\000u\000m\000n\000a\000m\000e\000>\000D\000r\000i\000v\000i\000n\000g\000 \000F\000o\000r\000 \000T\000h\000e\000 \000S\000t\000o\000r\000m\000/\000D\000o\000c\000t\000o\000r\000 \000D\000o\000c\000t\000o\000r\000<\000/\000a\000l\000b\000u\000m\000n\000a\000m\000e\000>\000<\000a\000l\000b\000u\000m\000i\000d\000>\0008\0003\000-\0004\0002\0002\0006\0009\000<\000/\000a\000l\000b\000u\000m\000i\000d\000>\000<\000a\000l\000b\000u\000m\000i\000m\000a\000g\000e\000>\000h\000t\000t\000p\000:\000/\000/\000w\000w\000w\000.\000a\000b\000c\000.\000n\000e\000t\000.\000a\000u\000/\000t\000r\000i\000p\000l\000e\000j\000/\000c\000o\000v\000e\000r\000s\000/\000G\000y\000r\000o\000s\000c\000o\000p\000e\000 \000-\000 \000D\000r\000i\000v\000i\000n\000g\000 \000F\000o\000r\000 \000T\000h\000e\000 \000S\000t\000o\000r\000m\000/\000D\000o\000c\000t\000o\000r\000 \000D\000o\000c\000t\000o\000r\000 \000(\0002\0000\0000\0003\000)\000.\000j\000p\000g\000<\000/\000a\000l\000b\000u\000m\000i\000m\000a\000g\000e\000>\000<\000/\000a\000l\000b\000u\000m\000>\000<\000a\000r\000t\000i\000s\000t\000>\000<\000a\000r\000t\000i\000s\000t\000n\000a\000m\000e\000>\000G\000y\000r\000o\000s\000c\000o\000p\000e\000<\000/\000a\000r\000t\000i\000s\000t\000n\000a\000m\000e\000>\000<\000a\000r\000t\000i\000s\000t\000i\000d\000>\000<\000/\000a\000r\000t\000i\000s\000t\000i\000d\000>\000<\000a\000r\000t\000i\000s\000t\000n\000o\000t\000e\000>\000<\000/\000a\000r\000t\000i\000s\000t\000n\000o\000t\000e\000>\000<\000a\000r\000t\000i\000s\000t\000l\000i\000n\000k\000>\000<\000/\000a\000r\000t\000i\000s\000t\000l\000i\000n\000k\000>\000<\000/\000a\000r\000t\000i\000s\000t\000>\000<\000/\000i\000t\000e\000m\000>\000<\000/\000i\000t\000e\000m\000s\000>\000<\000/\000a\000b\000c\000m\000u\000s\000i\000c\000_\000p\000l\000a\000y\000o\000u\000t\000>\000"
>> xmlDoc = REXML::Document.new(triplejString)
REXML::ParseException: #<REXML::ParseException: malformed XML: missing tag start
Line:
Position:
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-16"?><a>
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/parsers/baseparser.rb:356:in `pull'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:22:in `parse'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:227:in `build'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:43:in `initialize'
(irb):19:in `new'
(irb):19:in `irb_binding'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/irb/workspace.rb:52:in `irb_binding'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/irb/workspace.rb:52
...
malformed XML: missing tag start
Line:
Position:
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-16"?><a
Line:
Position:
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-16"?><a
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:92:in `parse'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:227:in `build'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:43:in `initialize'
from (irb):19:in `new'
from (irb):19
Any ideas?