Problem with eastern european characters when scraping data from the European Parliaments Website

Posted by Thomas Jensen on Stack Overflow See other posts from Stack Overflow or by Thomas Jensen
Published on 2010-06-10T09:58:54Z Indexed on 2010/06/10 10:02 UTC
Read the original article Hit count: 364

Filed under:
|
|

Dear Experts

I am trying to scrape a lot of data from the European Parliament website for a research project. Ther first step is the create a list if all parliamentarians, however due to the many Eastern European names and the accents they use i get a lot of missing entries. Here is an example of what is giving me troubles (notice the accents at the end of the family name):

ANDRIKIENE, Laima Liucija
Group of the European People's Party (Christian Democrats)

So far I have been using PyParser and the following code:

parser_names

name = Word(alphanums + alphas8bit) begin, end = map(Suppress, "><") names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end

for name in names.searchString(page): print(name)

However this does not catch the name from the html above. Any advice in how to proceed?

Best, Thomas

© Stack Overflow or respective owner

Related posts about python

Related posts about html-parsing