Problem with eastern european characters when scraping data from the European Parliaments Website
- by Thomas Jensen
Dear Experts
I am trying to scrape a lot of data from the European Parliament website for a research project. Ther first step is the create a list if all parliamentarians, however due to the many Eastern European names and the accents they use i get a lot of missing entries. Here is an example of what is giving me troubles (notice the accents at the end of the family name):
ANDRIKIENE, Laima Liucija
Group of the European People's Party (Christian Democrats)
So far I have been using PyParser and the following code:
parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "<")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page):
print(name)
However this does not catch the name from the html above. Any advice in how to proceed?
Best, Thomas