Problem with eastern european characters when scraping data from the European Parliaments Website
Posted
by Thomas Jensen
on Stack Overflow
See other posts from Stack Overflow
or by Thomas Jensen
Published on 2010-06-10T09:58:54Z
Indexed on
2010/06/10
10:02 UTC
Read the original article
Hit count: 361
Dear Experts
I am trying to scrape a lot of data from the European Parliament website for a research project. Ther first step is the create a list if all parliamentarians, however due to the many Eastern European names and the accents they use i get a lot of missing entries. Here is an example of what is giving me troubles (notice the accents at the end of the family name):
ANDRIKIENE, Laima Liucija
Group of the European People's Party (Christian Democrats)
So far I have been using PyParser and the following code:
parser_names
name = Word(alphanums + alphas8bit) begin, end = map(Suppress, "><") names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page): print(name)
However this does not catch the name from the html above. Any advice in how to proceed?
Best, Thomas
© Stack Overflow or respective owner