BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are

Posted by AP257 on Stack Overflow See other posts from Stack Overflow or by AP257
Published on 2010-06-02T10:58:49Z Indexed on 2010/06/02 11:53 UTC
Read the original article Hit count: 265

Filed under:
|

I'm trying to scrape all the inner html from the <p> elements in a web page using BeautifulSoup. There are internal tags, but I don't care, I just want to get the internal text.

For example, for:

<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>

How can I extract:

Red
Blue
Yellow
Light green

Neither .string nor .contents[0] does what I need. Nor does .extract(), because I don't want to have to specify the internal tags in advance - I want to deal with any that may occur.

Is there a 'just get the visible HTML' type of method in BeautifulSoup?

----UPDATE------

On advice, trying:

 p_tags = page.findAll('p',text=True)
 for i, p_tag in enumerate(p_tags): 
     print str(p_tag)

But that doesn't help - it just prints out:

Red
<i>Blue</i>
Yellow
Light <b>green</b>

© Stack Overflow or respective owner

Related posts about python

Related posts about beautifulsoup