haidon - Developer IT

Set a script to automatically detect character encoding in a plain-text-file in Python?

- by Haidon

I've set up a script that basically does a large-scale find-and-replace on a plain text document. At the moment it works fine with ASCII, UTF-8, and UTF-16 (and possibly others, but I've only tested these three) encoded documents so long as the encoding is specified inside the script (the example code below specifies UTF-16). Is there a way to make the script automatically detect which of these character encodings is being used in the input file and automatically set the character encoding of the output file the same as the encoding used on the input file? findreplace = [ ('term1', 'term2'), ] inF = open(infile,'rb') s=unicode(inF.read(),'utf-16') inF.close() for couple in findreplace: outtext=s.replace(couple[0],couple[1]) s=outtext outF = open(outFile,'wb') outF.write(outtext.encode('utf-16')) outF.close() Thanks!

Read the article

Regular expressions in a Python find-and-replace script?

- by Haidon

I'm new to Python scripting, so please forgive me in advance if the answer to this question seems inherently obvious. I'm trying to put together a large-scale find-and-replace script using Python. I'm using code similar to the following: findreplace = [ ('term1', 'term2'), ] inF = open(infile,'rb') s=unicode(inF.read(),charenc) inF.close() for couple in findreplace: outtext=s.replace(couple[0],couple[1]) s=outtext outF = open(outFile,'wb') outF.write(outtext.encode('utf-8')) outF.close() How would I go about having the script do a find and replace for regular expressions? Specifically, I want it to find some information (metadata) specified at the top of a text file. Eg: Title: This is the title Author: This is the author Date: This is the date and convert it into LaTeX format. Eg: \title{This is the title} \author{This is the author} \date{This is the date} Maybe I'm tackling this the wrong way. If there's a better way than regular expressions please let me know! Thanks!

Read the article

Regular Expression for accurate word-count using JavaScript

- by Haidon

I'm trying to put together a regular expression for a JavaScript command that accurately counts the number of words in a textarea. One solution I had found is as follows: document.querySelector("#wordcount").innerHTML = document.querySelector("#editor").value.split(/\b\w+\b/).length -1; But this doesn't count any non-Latin characters (eg: Cyrillic, Hangul, etc); it skips over them completely. Another one I put together: document.querySelector("#wordcount").innerHTML = document.querySelector("#editor").value.split(/\s+/g).length -1; But this doesn't count accurately unless the document ends in a space character. If a space character is appended to the value being counted it counts 1 word even with an empty document. Furthermore, if the document begins with a space character an extraneous word is counted. Is there a regular expression I can put into this command that counts the words accurately, regardless of input method?

Read the article

Please help me optimize my Python code

- by Haidon

Beginner here! Forgive me in advance for raising what is probably an incredibly simple problem. I've been trying to put together a Python script that runs multiple find-and-replace actions and a few similar things on a specified plain-text file. It works, but from a programming perspective I doubt it works well. How would I best go about optimizing the actions made upon the 'outtext' variable? At the moment it's basically doing a very similar thing four times over... import binascii import re import struct import sys infile = sys.argv[1] charenc = sys.argv[2] outFile=infile+'.tex' findreplace = [ ('TERM1', 'TERM2'), ('TERM3', 'TERM4'), ('TERM5', 'TERM6'), ] inF = open(infile,'rb') s=unicode(inF.read(),charenc) inF.close() # THIS IS VERY MESSY. for couple in findreplace: outtext=s.replace(couple[0],couple[1]) s=outtext for couple in findreplace: outtext=re.compile('Title: (.*)', re.I).sub(r'\\title'+ r'{\1}', s) s=outtext for couple in findreplace: outtext=re.compile('Author: (.*)', re.I).sub(r'\\author'+ r'{\1}', s) s=outtext for couple in findreplace: outtext=re.compile('Date: (.*)', re.I).sub(r'\\date'+ r'{\1}', s) s=outtext # END MESSY SECTION. outF = open(outFile,'wb') outF.write(outtext.encode('utf-8')) outF.close()

Developer IT