Extra characters Extracted with XPath and Python (html)
Posted
by Nacari
on Stack Overflow
See other posts from Stack Overflow
or by Nacari
Published on 2010-05-25T22:47:14Z
Indexed on
2010/05/25
22:51 UTC
Read the original article
Hit count: 249
I have been using XPath with scrapy to extract text from html tags online, but when I do I get extra characters attached. An example is trying to extract a number, like "204" from a <td>
tag and getting [u'204']
. In some cases its much worse. For instance trying to extract "1 - Mathoverflow" and instead getting [u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t ']
. Is there a way to prevent this, or trim the strings so that the extra characters arent a part of the string? (using items to store the data). It looks like it has something to do with formatting, so how do I get xpath to not pick up that stuff?
© Stack Overflow or respective owner