What's the fastest way to strip and replace a document of high unicode characters using Python?

Posted by Rhubarb on Stack Overflow See other posts from Stack Overflow or by Rhubarb
Published on 2010-05-18T02:29:38Z Indexed on 2010/05/18 2:50 UTC
Read the original article Hit count: 339

Filed under:

python

|

ascii

|

unicode

|

parse

|

text-processing

I am looking to replace from a large document all high unicode characters, such as accented Es, left and right quotes, etc., with "normal" counterparts in the low range, such as a regular 'E', and straight quotes. I need to perform this on a very large document rather often. I see an example of this in what I think might be perl here: http://www.designmeme.com/mtplugins/lowdown.txt

Is there a fast way of doing this in Python without using s.replace(...).replace(...).replace(...)...? I've tried this on just a few characters to replace and the document stripping became really slow.

© Stack Overflow or respective owner

Related posts about python

unmet dependencies in Ubuntu 12.04

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I tried today to install a dvb-card on my Ubuntu 12.04 (Linux blauhai-linux 3.2.0-25-generic #40-Ubuntu SMP Wed May 23 20:30:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux ). The installation failed with an error. After that, i tried to install python (it was already installed but i got this error): linux:~$… >>> More
How can I get sikuli-ide to work?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I installed sikuli-ide with sudo apt-get install sikuli-ide Everything was fine until I tried to start it from the terminal. I typed sikuli-ide But the only response I got was [info] locale: en_US The application was not started, furthermore there is no desktop file and sikuli-ide does not… >>> More
Getting PATH right for python after MacPorts install

as seen on Super User - Search for 'Super User'
I can't import some python libraries (PIL, psycopg2) that I just installed with MacPorts. I looked through these forums, and tried to adjust my PATH variable in $HOME/.bash_profile in order to fix this but it did not work. I added the location of PIL and psycopg2 to PATH. I know that Terminal is… >>> More
call python with system() in R to run a python script emulating the python console

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to pass a chunk of Python code to Python in R with something like system('python ...'), and I'm wondering if there is an easy way to emulate the python console in this case. For example, suppose the code is "print 'hello world'", how can I get the output like this in R? >>> print… >>> More
Python - Calling a non python program from python?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am currently struggling to call a non python program from a python script. I have a ~1000 files that when passed through this C++ program will generate ~1000 outputs. Each output file must have a distinct name. The command I wish to run is of the form: program_name -input -output -o1 -o2… >>> More

Related posts about ascii

Planning Database as a Service Implementation Project

as seen on Oracle Blogs - Search for 'Oracle Blogs'
People, process and planning are the three key elements to success in a private cloud journey. Some common questions i hear from field/customers often relates to tasks involved in setting up Database-as-a-Service(DBaaS) using Oracle Enterprise Manager 12c from scratch and how these tasks are mapped… >>> More
Getting less to stop highlighting non-ascii characters

as seen on Super User - Search for 'Super User'
I have some logs which I want to read in less. Unfortunately, scattered throughout the otherwise text file are some non-ascii characters that less displays as highlighted junk. This is really annoying. How do I get rid of the highlighting? >>> More
GNU Screen: using VT100 ASCII codes

as seen on Super User - Search for 'Super User'
I try to move with the VT100 keys here in GNU Screen: $ screen $ hello <left><left> ESC C ESC D but "ESC D" deletes until the end of line and ESC C does nothing. Perhaps, I am not using right VT100 emulation. How can I test it and how can I get the ASCII codes working from commandline… >>> More
String that contains all ascii characters

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I want to create a string in JavaScript that contains all ascii characters. How can I do this? >>> More
Are Extended ASCII characters safe for filenames & folders?

as seen on Stack Overflow - Search for 'Stack Overflow'
My C# project saves files and creates folders with the extended ASCII character "²" (superscript 2). Is this safe to work with internationally? Is this something that could cause any issues with the .NET libraries or Windows functions? >>> More