Hi, trying to parse an XML with Beautifulsoup, but hit a brick wall when trying to use the "recursive" attribute with findall()
I have a pretty odd xml format shown below:
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
<catalog>true</catalog>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
<catalog>false</catalog>
</book>
</catalog>
As you can see, the catalog tag repeats inside the book tag, which causes an error when I try to to something like:
from BeautifulSoup import BeautifulStoneSoup as BSS
catalog = "catalog.xml"
def open_rss():
f = open(catalog, 'r')
return f.read()
def rss_parser():
rss_contents = open_rss()
soup = BSS(rss_contents)
items = soup.findAll('catalog', recursive=False)
for item in items:
print item.title.string
rss_parser()
As you will see, on my soup.findAll I've added recursive=false, which in theory would make it no recurse through the item found, but skip to the next one.
This doesn't seem to work, as I always get the following error:
File "catalog.py", line 17, in rss_parser
print item.title.string
AttributeError: 'NoneType' object has no attribute 'string'
I'm sure I'm doing something stupid here, and would appreciate if someone could give me some help on how to solve this problem.
Changing the HTML structure is not an option, this this code needs to perform well as it will potentially parse a large XML file.
Thanks in advance,
Marcos