Problem with eastern european characters when scraping data from the European Parliaments Website

Posted by Thomas Jensen on Stack Overflow See other posts from Stack Overflow or by Thomas Jensen
Published on 2010-06-10T09:58:54Z Indexed on 2010/06/10 10:02 UTC
Read the original article Hit count: 404

Filed under:

python

|

html-parsing

|

scraping

Dear Experts

I am trying to scrape a lot of data from the European Parliament website for a research project. Ther first step is the create a list if all parliamentarians, however due to the many Eastern European names and the accents they use i get a lot of missing entries. Here is an example of what is giving me troubles (notice the accents at the end of the family name):

ANDRIKIENE, Laima Liucija
Group of the European People's Party (Christian Democrats)

So far I have been using PyParser and the following code:

parser_names

name = Word(alphanums + alphas8bit) begin, end = map(Suppress, "><") names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end

for name in names.searchString(page): print(name)

However this does not catch the name from the html above. Any advice in how to proceed?

Best, Thomas

© Stack Overflow or respective owner

Related posts about python

unmet dependencies in Ubuntu 12.04

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I tried today to install a dvb-card on my Ubuntu 12.04 (Linux blauhai-linux 3.2.0-25-generic #40-Ubuntu SMP Wed May 23 20:30:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux ). The installation failed with an error. After that, i tried to install python (it was already installed but i got this error): linux:~$… >>> More
How can I get sikuli-ide to work?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I installed sikuli-ide with sudo apt-get install sikuli-ide Everything was fine until I tried to start it from the terminal. I typed sikuli-ide But the only response I got was [info] locale: en_US The application was not started, furthermore there is no desktop file and sikuli-ide does not… >>> More
Getting PATH right for python after MacPorts install

as seen on Super User - Search for 'Super User'
I can't import some python libraries (PIL, psycopg2) that I just installed with MacPorts. I looked through these forums, and tried to adjust my PATH variable in $HOME/.bash_profile in order to fix this but it did not work. I added the location of PIL and psycopg2 to PATH. I know that Terminal is… >>> More
call python with system() in R to run a python script emulating the python console

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to pass a chunk of Python code to Python in R with something like system('python ...'), and I'm wondering if there is an easy way to emulate the python console in this case. For example, suppose the code is "print 'hello world'", how can I get the output like this in R? >>> print… >>> More
Python - Calling a non python program from python?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am currently struggling to call a non python program from a python script. I have a ~1000 files that when passed through this C++ program will generate ~1000 outputs. Each output file must have a distinct name. The command I wish to run is of the form: program_name -input -output -o1 -o2… >>> More

Related posts about html-parsing

html parsing with libxml

as seen on Stack Overflow - Search for 'Stack Overflow'
In another thread I got convinced into using HTML parsers instead of regexps for HTML parsing (I thought they would work fine, but they didn't ;) ). I thought of using libxml (it has some HTML parser built in), but failed to find any useful tutorial. I also found this site and it says here it should… >>> More
RUBY Nokogiri CSS HTML Parsing

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm having some problems trying to get the code below to output the data in the format that I want. What I'm after is the following: CCC1-$5.00 CCC1-$10.00 CCC1-$15.00 CCC2-$7.00 where $7 belongs to CCC2 and the others to CCC1, but I can only manage to get the data in this format: … >>> More
HTML parsing - fetch and update data from the .html file

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a form in a .html files where input/select box looks like this <input type="text" id="txtName" name="txtName" value="##myName##" /> <select id="cbGender" name="cbGender"> <option>Select</option> <option selected="selected">Male</option> <option>Female</option> </select> I… >>> More
html parsing in c#

as seen on Stack Overflow - Search for 'Stack Overflow'
hi, How can i parse values from the scoreboard of http://www.cricinfo.com/nzvaus2010/engine/current/match/423789.html Any help will be appreciated. >>> More
Java HTML Parsing

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello everyone. I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for "div class = "classname""… >>> More