Python Speeding Up Retrieving data from extremely large string
Posted
by
Burninghelix123
on Stack Overflow
See other posts from Stack Overflow
or by Burninghelix123
Published on 2012-10-19T21:38:25Z
Indexed on
2012/10/20
23:01 UTC
Read the original article
Hit count: 223
I have a list I converted to a very very long string as I am trying to edit it, as you can gather it's called tempString. It works as of now it just takes way to long to operate, probably because it is several different regex subs. They are as follow:
tempString = ','.join(str(n) for n in coords)
tempString = re.sub(',{2,6}', '_', tempString)
tempString = re.sub("[^0-9\-\.\_]", ",", tempString)
tempString = re.sub(',+', ',', tempString)
clean1 = re.findall(('[-+]?[0-9]*\.?[0-9]+,[-+]?[0-9]*\.?[0-9]+,'
'[-+]?[0-9]*\.?[0-9]+'), tempString)
tempString = '_'.join(str(n) for n in clean1)
tempString = re.sub(',', ' ', tempString)
Basically it's a long string containing commas and about 1-5 million sets of 4 floats/ints (mixture of both possible),:
-5.65500020981,6.88999986649,-0.454999923706,1,,,-5.65500020981,6.95499992371,-0.454999923706,1,,,
The 4th number in each set I don't need/want, i'm essentially just trying to split the string into a list with 3 floats in each separated by a space.
The above code works flawlessly but as you can imagine is quite time consuming on large strings.
I have done a lot of research on here for a solution but they all seem geared towards words, i.e. swapping out one word for another.
EDIT: Ok so this is the solution i'm currently using:
def getValues(s):
output = []
while s:
# get the three values you want, discard the 3 commas, and the
# remainder of the string
v1, v2, v3, _, _, _, s = s.split(',', 6)
output.append("%s %s %s" % (v1.strip(), v2.strip(), v3.strip()))
return output
coords = getValues(tempString)
Anyone have any advice to speed this up even farther? After running some tests It still takes much longer than i'm hoping for.
I've been glancing at numPy, but I honestly have absolutely no idea how to the above with it, I understand that after the above has been done and the values are cleaned up i could use them more efficiently with numPy, but not sure how NumPy could apply to the above.
The above to clean through 50k sets takes around 20 minutes, I cant imagine how long it would be on my full string of 1 million sets. I'ts just surprising that the program that originally exported the data took only around 30 secs for the 1 million sets
© Stack Overflow or respective owner