Search Results

Search found 13534 results on 542 pages for 'python 2 x'.

Page 156/542 | < Previous Page | 152 153 154 155 156 157 158 159 160 161 162 163  | Next Page >

  • Python: How best to parse a simple grammar?

    - by Rosarch
    Ok, so I've asked a bunch of smaller questions about this project, but I still don't have much confidence in the designs I'm coming up with, so I'm going to ask a question on a broader scale. I am parsing pre-requisite descriptions for a course catalog. The descriptions almost always follow a certain form, which makes me think I can parse most of them. From the text, I would like to generate a graph of course pre-requisite relationships. (That part will be easy, after I have parsed the data.) Some sample inputs and outputs: "CS 2110" => ("CS", 2110) # 0 "CS 2110 and INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1 "CS 2110, INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1 "CS 2110, 3300, 3140" => [("CS", 2110), ("CS", 3300), ("CS", 3140)] # 1 "CS 2110 or INFO 3300" => [[("CS", 2110)], [("INFO", 3300)]] # 2 "MATH 2210, 2230, 2310, or 2940" => [[("MATH", 2210), ("MATH", 2230), ("MATH", 2310)], [("MATH", 2940)]] # 3 If the entire description is just a course, it is output directly. If the courses are conjoined ("and"), they are all output in the same list If the course are disjoined ("or"), they are in separate lists Here, we have both "and" and "or". One caveat that makes it easier: it appears that the nesting of "and"/"or" phrases is never greater than as shown in example 3. What is the best way to do this? I started with PLY, but I couldn't figure out how to resolve the reduce/reduce conflicts. The advantage of PLY is that it's easy to manipulate what each parse rule generates: def p_course(p): 'course : DEPT_CODE COURSE_NUMBER' p[0] = (p[1], int(p[2])) With PyParse, it's less clear how to modify the output of parseString(). I was considering building upon @Alex Martelli's idea of keeping state in an object and building up the output from that, but I'm not sure exactly how that is best done. def addCourse(self, str, location, tokens): self.result.append((tokens[0][0], tokens[0][1])) def makeCourseList(self, str, location, tokens): dept = tokens[0][0] new_tokens = [(dept, tokens[0][1])] new_tokens.extend((dept, tok) for tok in tokens[1:]) self.result.append(new_tokens) For instance, to handle "or" cases: def __init__(self): self.result = [] # ... self.statement = (course_data + Optional(OR_CONJ + course_data)).setParseAction(self.disjunctionCourses) def disjunctionCourses(self, str, location, tokens): if len(tokens) == 1: return tokens print "disjunction tokens: %s" % tokens How does disjunctionCourses() know which smaller phrases to disjoin? All it gets is tokens, but what's been parsed so far is stored in result, so how can the function tell which data in result corresponds to which elements of token? I guess I could search through the tokens, then find an element of result with the same data, but that feel convoluted... What's a better way to approach this problem?

    Read the article

  • Python: Count lines and differentiate between them

    - by Mister X
    I'm using an application that gives a timed output based on how many times something is done in a minute, and I wish to manually take the output (copy paste) and have my program, and I wish to count how many times each minute it is done. An example output is this: 13:48 An event happened. 13:48 Another event happened. 13:49 A new event happened. 13:49 A random event happened. 13:49 An event happened. So, the program would need to understand that 2 things happened at 13:48, and 3 at 13:49. I'm not sure how the information would be stored, but I need to average them after, to determine an average of how often it happens. Sorry for being so complicated!

    Read the article

  • Python: How do sets work

    - by Guy
    I have a list of objects which I want to turn into a set. My objects contain a few fields that some of which are o.id and o.area. I want two objects to be equal if these two fields are the same. ie: o1==o2 if and only if o1.area==o2.area and o1.id==o2.id. I tried over-writing __eq__ and __cmp__ but I get the error: TypeError: unhashable instance. What should I over-write?

    Read the article

  • compute mean in python for a generator

    - by nmaxwell
    Hi, I'm doing some statistics work, I have a (large) collection of random numbers to compute the mean of, I'd like to work with generators, because I just need to compute the mean, so I don't need to store the numbers. The problem is that numpy.mean breaks if you pass it a generator. I can write a simple function to do what I want, but I'm wondering if there's a proper, built-in way to do this? It would be nice if I could say "sum(values)/len(values)", but len doesn't work for genetators, and sum already consumed values. here's an example: import numpy def my_mean(values): n = 0 Sum = 0.0 try: while True: Sum += next(values) n += 1 except StopIteration: pass return float(Sum)/n X = [k for k in range(1,7)] Y = (k for k in range(1,7)) print numpy.mean(X) print my_mean(Y) these both give the same, correct, answer, buy my_mean doesn't work for lists, and numpy.mean doesn't work for generators. I really like the idea of working with generators, but details like this seem to spoil things. thanks for any help -nick

    Read the article

  • Python: Created nested dictionary from list of paths

    - by sberry2A
    I have a list of tuples the looks similar to this (simplified here, there are over 14,000 of these tuples with more complicated paths than Obj.part) [ (Obj1.part1, {<SPEC>}), (Obj1.partN, {<SPEC>}), (ObjK.partN, {<SPEC>}) ] Where Obj goes from 1 - 1000, part from 0 - 2000. These "keys" all have a dictionary of specs associated with them which act as a lookup reference for inspecting another binary file. The specs dict contains information such as the bit offset, bit size, and C type of the data pointed to by the path ObjK.partN. For example: Obj4.part500 might have this spec, {'size':32, 'offset':128, 'type':'int'} which would let me know that to access Obj4.part500 in the binary file I must unpack 32 bits from offset 128. So, now I want to take my list of strings and create a nested dictionary which in the simplified case will look like this data = { 'Obj1' : {'part1':{spec}, 'partN':{spec} }, 'ObjK' : {'part1':{spec}, 'partN':{spec} } } To do this I am currently doing two things, 1. I am using a dotdict class to be able to use dot notation for dictionary get / set. That class looks like this: class dotdict(dict): def __getattr__(self, attr): return self.get(attr, None) __setattr__ = dict.__setitem__ __delattr__ = dict.__delitem__ The method for creating the nested "dotdict"s looks like this: def addPath(self, spec, parts, base): if len(parts) > 1: item = base.setdefault(parts[0], dotdict()) self.addPath(spec, parts[1:], item) else: item = base.setdefault(parts[0], spec) return base Then I just do something like: for path, spec in paths: self.lookup = dotdict() self.addPath(spec, path.split("."), self.lookup) So, in the end self.lookup.Obj4.part500 points to the spec. Is there a better (more pythonic) way to do this?

    Read the article

  • Using __str__ representation for printing objects in containers in Python

    - by BobDobbs
    I've noticed that when an instance with an overloaded str method is passed to the print() function as an argument, it prints as intended. However, when passing a container that contains one of those instances to print(), it uses the repr method instead. That is to say, print(x) displays the correct string representation of x, and print(x, y) works correctly, but print([x]) or print((x, y)) prints the repr representation instead. First off, why does this happen? Secondly, is there a way to correct that behavior of print() in this circumstance?

    Read the article

  • Find last match with python regular expression

    - by SDD
    I wanto to match the last occurence of a simple pattern in a string, e.g. list = re.findall(r"\w+ AAAA \w+", "foo bar AAAA foo2 AAAA bar2) print "last match: ", list[len(list)-1] however, if the string is very long, a huge list of matches is generated. Is there a more direct way to match the second occurence of "AAAA" or should I use this workaround?

    Read the article

  • Python - pickling fails for numpy.void objects

    - by I82Much
    >>> idmapfile = open("idmap", mode="w") >>> pickle.dump(idMap, idmapfile) >>> idmapfile.close() >>> idmapfile = open("idmap") >>> unpickled = pickle.load(idmapfile) >>> unpickled == idMap False idMap[1] {1537: (552, 1, 1537, 17.793827056884766, 3), 1540: (4220, 1, 1540, 19.31205940246582, 3), 1544: (592, 1, 1544, 18.129131317138672, 3), 1675: (529, 1, 1675, 18.347782135009766, 3), 1550: (4048, 1, 1550, 19.31205940246582, 3), 1424: (1528, 1, 1424, 19.744396209716797, 3), 1681: (1265, 1, 1681, 19.596025466918945, 3), 1560: (3457, 1, 1560, 20.530569076538086, 3), 1690: (477, 1, 1690, 17.395542144775391, 3), 1691: (554, 1, 1691, 13.446117401123047, 3), 1436: (3010, 1, 1436, 19.596025466918945, 3), 1434: (3183, 1, 1434, 19.744396209716797, 3), 1441: (3570, 1, 1441, 20.589576721191406, 3), 1435: (476, 1, 1435, 19.640911102294922, 3), 1444: (527, 1, 1444, 17.98480224609375, 3), 1478: (1897, 1, 1478, 19.596025466918945, 3), 1575: (614, 1, 1575, 19.371648788452148, 3), 1586: (2189, 1, 1586, 19.31205940246582, 3), 1716: (3470, 1, 1716, 19.158674240112305, 3), 1590: (2278, 1, 1590, 19.596025466918945, 3), 1463: (991, 1, 1463, 19.31205940246582, 3), 1594: (1890, 1, 1594, 19.596025466918945, 3), 1467: (1087, 1, 1467, 19.31205940246582, 3), 1596: (3759, 1, 1596, 19.744396209716797, 3), 1602: (3011, 1, 1602, 20.530569076538086, 3), 1547: (490, 1, 1547, 17.994071960449219, 3), 1605: (658, 1, 1605, 19.31205940246582, 3), 1606: (1794, 1, 1606, 16.964881896972656, 3), 1719: (1826, 1, 1719, 19.596025466918945, 3), 1617: (583, 1, 1617, 11.894925117492676, 3), 1492: (3441, 1, 1492, 20.500667572021484, 3), 1622: (3215, 1, 1622, 19.31205940246582, 3), 1628: (2761, 1, 1628, 19.744396209716797, 3), 1502: (1563, 1, 1502, 19.596025466918945, 3), 1632: (1108, 1, 1632, 15.457141876220703, 3), 1468: (3779, 1, 1468, 19.596025466918945, 3), 1642: (3970, 1, 1642, 19.744396209716797, 3), 1518: (612, 1, 1518, 18.570245742797852, 3), 1647: (854, 1, 1647, 16.964881896972656, 3), 1650: (2099, 1, 1650, 20.439058303833008, 3), 1651: (540, 1, 1651, 18.552841186523438, 3), 1653: (613, 1, 1653, 19.237197875976563, 3), 1532: (537, 1, 1532, 18.885730743408203, 3)} >>> unpickled[1] {1537: (64880, 1638, 56700, -1.0808743559293829e+18, 152), 1540: (64904, 1638, 0, 0.0, 0), 1544: (54472, 1490, 0, 0.0, 0), 1675: (6464, 1509, 0, 0.0, 0), 1550: (43592, 1510, 0, 0.0, 0), 1424: (43616, 1510, 0, 0.0, 0), 1681: (0, 0, 0, 0.0, 0), 1560: (400, 152, 400, 2.1299736657737219e-43, 0), 1690: (408, 152, 408, 2.7201111331839077e+26, 34), 1435: (424, 152, 61512, 1.0122952080313192e-39, 0), 1436: (400, 152, 400, 20.250289916992188, 3), 1434: (424, 152, 62080, 1.0122952080313192e-39, 0), 1441: (400, 152, 400, 12.250144958496094, 3), 1691: (424, 152, 42608, 15.813941955566406, 3), 1444: (400, 152, 400, 19.625289916992187, 3), 1606: (424, 152, 42432, 5.2947192852601414e-22, 41), 1575: (400, 152, 400, 6.2537390010262572e-36, 0), 1586: (424, 152, 42488, 1.0122601755697111e-39, 0), 1716: (400, 152, 400, 6.2537390010262572e-36, 0), 1590: (424, 152, 64144, 1.0126357235581501e-39, 0), 1463: (400, 152, 400, 6.2537390010262572e-36, 0), 1594: (424, 152, 32672, 17.002994537353516, 3), 1467: (400, 152, 400, 19.750289916992187, 3), 1596: (424, 152, 7176, 1.0124003054161436e-39, 0), 1602: (400, 152, 400, 18.500289916992188, 3), 1547: (424, 152, 7000, 1.0124003054161436e-39, 0), 1605: (400, 152, 400, 20.500289916992188, 3), 1478: (424, 152, 42256, -6.0222748507426518e+30, 222), 1719: (400, 152, 400, 6.2537390010262572e-36, 0), 1617: (424, 152, 16472, 1.0124283313854301e-39, 0), 1492: (400, 152, 400, 6.2537390010262572e-36, 0), 1622: (424, 152, 35304, 1.0123190301052127e-39, 0), 1628: (400, 152, 400, 6.2537390010262572e-36, 0), 1502: (424, 152, 63152, 19.627988815307617, 3), 1632: (400, 152, 400, 19.375289916992188, 3), 1468: (424, 152, 38088, 1.0124213248931084e-39, 0), 1642: (400, 152, 400, 6.2537390010262572e-36, 0), 1518: (424, 152, 63896, 1.0127436235399031e-39, 0), 1647: (400, 152, 400, 6.2537390010262572e-36, 0), 1650: (424, 152, 53424, 16.752857208251953, 3), 1651: (400, 152, 400, 19.250289916992188, 3), 1653: (424, 152, 50624, 1.0126497365427934e-39, 0), 1532: (400, 152, 400, 6.2537390010262572e-36, 0)} The keys come out fine, the values are screwed up. I tried same thing loading file in binary mode; didn't fix the problem. Any idea what I'm doing wrong? Edit: Here's the code with binary. Note that the values are different in the unpickled object. >>> idmapfile = open("idmap", mode="wb") >>> pickle.dump(idMap, idmapfile) >>> idmapfile.close() >>> idmapfile = open("idmap", mode="rb") >>> unpickled = pickle.load(idmapfile) >>> unpickled==idMap False >>> unpickled[1] {1537: (12176, 2281, 56700, -1.0808743559293829e+18, 152), 1540: (0, 0, 15934, 2.7457842047810522e+26, 108), 1544: (400, 152, 400, 4.9518498821046956e+27, 53), 1675: (408, 152, 408, 2.7201111331839077e+26, 34), 1550: (456, 152, 456, -1.1349175514578289e+18, 152), 1424: (432, 152, 432, 4.5939047815653343e-40, 11), 1681: (408, 152, 408, 2.1299736657737219e-43, 0), 1560: (376, 152, 376, 2.1299736657737219e-43, 0), 1690: (376, 152, 376, 2.1299736657737219e-43, 0), 1435: (376, 152, 376, 2.1299736657737219e-43, 0), 1436: (376, 152, 376, 2.1299736657737219e-43, 0), 1434: (376, 152, 376, 2.1299736657737219e-43, 0), 1441: (376, 152, 376, 2.1299736657737219e-43, 0), 1691: (376, 152, 376, 2.1299736657737219e-43, 0), 1444: (376, 152, 376, 2.1299736657737219e-43, 0), 1606: (25784, 2281, 376, -3.2883343074537754e+26, 34), 1575: (24240, 2281, 376, 2.1299736657737219e-43, 0), 1586: (24240, 2281, 376, 2.1299736657737219e-43, 0), 1716: (24240, 2281, 376, -3.0093091599657311e-35, 26), 1590: (24240, 2281, 376, 2.1299736657737219e-43, 0), 1463: (24240, 2281, 376, 2.1299736657737219e-43, 0), 1594: (24240, 2281, 376, -4123208450048.0, 196), 1467: (25784, 2281, 376, 2.1299736657737219e-43, 0), 1596: (25784, 2281, 376, 2.1299736657737219e-43, 0), 1602: (25784, 2281, 376, -5.9963281433905448e+26, 76), 1547: (25784, 2281, 376, -218106240.0, 139), 1605: (25784, 2281, 376, -3.7138649803377281e+27, 56), 1478: (376, 152, 376, 2.1299736657737219e-43, 0), 1719: (25784, 2281, 376, 2.1299736657737219e-43, 0), 1617: (25784, 2281, 376, -1.4411779941597184e+17, 237), 1492: (25784, 2281, 376, 2.8596493694487798e-30, 80), 1622: (25784, 2281, 376, 184686084096.0, 93), 1628: (1336, 152, 1336, 3.1691839245470052e+29, 179), 1502: (1272, 152, 1272, -5.2042207205116645e-17, 99), 1632: (1208, 152, 1208, 2.1299736657737219e-43, 0), 1468: (1144, 152, 1144, 2.1299736657737219e-43, 0), 1642: (1080, 152, 1080, 2.1299736657737219e-43, 0), 1518: (1016, 152, 1016, 4.0240902787680023e+35, 145), 1647: (952, 152, 952, -985172619034624.0, 237), 1650: (888, 152, 888, 12094787289088.0, 66), 1651: (824, 152, 824, 2.1299736657737219e-43, 0), 1653: (760, 152, 760, 0.00018310768064111471, 238), 1532: (696, 152, 696, 8.8978061885676389e+26, 125)} OK I've isolated the problem, but don't know why it's so. First, apparently what I'm pickling are not tuples (though they look like it), but instead numpy.void types. Here is a series to illustrate the problem. first = run0.detections[0] >>> first (1, 19, 1578, 82.637763977050781, 1) >>> type(first) <type 'numpy.void'> >>> firstTuple = tuple(first) >>> theFile = open("pickleTest", "w") >>> pickle.dump(first, theFile) >>> theTupleFile = open("pickleTupleTest", "w") >>> pickle.dump(firstTuple, theTupleFile) >>> theFile.close() >>> theTupleFile.close() >>> first (1, 19, 1578, 82.637763977050781, 1) >>> firstTuple (1, 19, 1578, 82.637764, 1) >>> theFile = open("pickleTest", "r") >>> theTupleFile = open("pickleTupleTest", "r") >>> unpickledTuple = pickle.load(theTupleFile) >>> unpickledVoid = pickle.load(theFile) >>> type(unpickledVoid) <type 'numpy.void'> >>> type(unpickledTuple) <type 'tuple'> >>> unpickledTuple (1, 19, 1578, 82.637764, 1) >>> unpickledTuple == firstTuple True >>> unpickledVoid == first False >>> unpickledVoid (7936, 1705, 56700, -1.0808743559293829e+18, 152) >>> first (1, 19, 1578, 82.637763977050781, 1)

    Read the article

  • Python 3.1 - Memory Error during sampling of a large list

    - by jimy
    The input list can be more than 1 million numbers. When I run the following code with smaller 'repeats', its fine; def sample(x): length = 1000000 new_array = random.sample((list(x)),length) return (new_array) def repeat_sample(x): i = 0 repeats = 100 list_of_samples = [] for i in range(repeats): list_of_samples.append(sample(x)) return(list_of_samples) repeat_sample(large_array) However, using high repeats such as the 100 above, results in MemoryError. Traceback is as follows; Traceback (most recent call last): File "C:\Python31\rnd.py", line 221, in <module> STORED_REPEAT_SAMPLE = repeat_sample(STORED_ARRAY) File "C:\Python31\rnd.py", line 129, in repeat_sample list_of_samples.append(sample(x)) File "C:\Python31\rnd.py", line 121, in sample new_array = random.sample((list(x)),length) File "C:\Python31\lib\random.py", line 309, in sample result = [None] * k MemoryError I am assuming I'm running out of memory. I do not know how to get around this problem. Thank you for your time!

    Read the article

  • Replacing python docstrings

    - by tomaz
    I have written a epytext to reST markup converter, and now I want to convert all the docstrings in my entire library from epytext to reST format. Is there a smart way to read the all the docstrings in a module and write back the replacements? ps: ast module perhaps?

    Read the article

  • Python recursion with list returns None

    - by newman
    def foo(a): a.append(1) if len(a) > 10: print a return a else: foo(a) Why this recursive function returns None (see transcript below)? I can't quite understand what I am doing wrong. In [263]: x = [] In [264]: y = foo(x) [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] In [265]: print y None

    Read the article

  • Using the AND and NOT Operator in Python

    - by NoahClark
    Here is my custom class that I have that represents a triangle. I'm trying to write code that checks to see if self.a, self.b, and self.c are greater than 0, which would mean that I have Angle, Angle, Angle. Below you will see the code that checks for A and B, however when I use just self.a != 0 then it works fine. I believe I'm not using & correctly. Any ideas? Here is how I am calling it: print myTri.detType() class Triangle: # Angle A To Angle C Connects Side F # Angle C to Angle B Connects Side D # Angle B to Angle A Connects Side E def __init__(self, a, b, c, d, e, f): self.a = a self.b = b self.c = c self.d = d self.e = e self.f = f def detType(self): #Triangle Type AAA if self.a != 0 & self.b != 0: return self.a #If self.a > 10: #return AAA #Triangle Type AAS #elif self.a = 0: #return AAS #Triangle Type ASA #Triangle Type SAS #Triangle Type SSS #else: #return unknown

    Read the article

  • [Tkinter/Python] Different line widths with canvas.create_line?

    - by Sam
    Does anyone have any idea why I get different line widths on the canvas in the following example? from Tkinter import * bigBoxSize = 150 class cFrame(Frame): def __init__(self, master, cwidth=450, cheight=450): Frame.__init__(self, master, relief=RAISED, height=550, width=600, bg = "grey") self.canvasWidth = cwidth self.canvasHeight = cheight self.canvas = Canvas(self, bg="white", width=cwidth, height=cheight, border =0) self.drawGridLines() self.canvas.pack(side=TOP, pady=20, padx=20) def drawGridLines(self, linewidth = 10): self.canvas.create_line(0, 0, self.canvasWidth, 0, width= linewidth ) self.canvas.create_line(0, 0, 0, self.canvasHeight, width= linewidth ) self.canvas.create_line(0, self.canvasHeight, self.canvasWidth + 2, self.canvasHeight, width= linewidth ) self.canvas.create_line(self.canvasWidth, self.canvasHeight, self.canvasWidth, 1, width= linewidth ) self.canvas.create_line(0, bigBoxSize, self.canvasWidth, bigBoxSize, width= linewidth ) self.canvas.create_line(0, bigBoxSize * 2, self.canvasWidth, bigBoxSize * 2, width= linewidth) root = Tk() C = cFrame(root) C.pack() root.mainloop() It's really frustrating me as I have no idea what's happening. If anyone can help me out then that'd be fantastic. Thanks!

    Read the article

  • Python - calendar.timegm() vs. time.mktime()

    - by ibz
    I seem to have a hard time getting my head around this. What's the difference between calendar.timegm() and time.mktime()? Say I have a datetime.datetime with no tzinfo attached, shouldn't the two give the same output? Don't they both give the number of seconds between epoch and the date passed as a parameter? And since the date passed has no tzinfo, isn't that number of seconds the same? >>> import calendar >>> import time >>> import datetime >>> d = datetime.datetime(2010, 10, 10) >>> calendar.timegm(d.timetuple()) 1286668800 >>> time.mktime(d.timetuple()) 1286640000.0 >>>

    Read the article

  • How to remove commas etc from a matrix in python

    - by robert
    say ive got a matrix that looks like: [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] how can i make it on seperate lines: [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] and then remove commas etc: 0 0 0 0 0 And also to make it blank instead of 0's, so that numbers can be put in later, so in the end it will be like: _ 1 2 _ 1 _ 1 (spaces not underscores) thanks

    Read the article

  • Dynamic variable name in python

    - by PhilGo20
    I'd like to call a query with a field name filter that I wont know before run time... Not sure how to construct the variable name ...Or maybe I am tired. field_name = funct() locations = Locations.objects.filter(field_name__lte=arg1) where if funct() returns name would equal to locations = Locations.objects.filter(name__lte=arg1) Not sure how to do that ...

    Read the article

  • how can i randomly print an element from a list in python

    - by lm
    So far i have this, which prints out every word in my list, but i am trying to print only one word at random. Any suggestions? def main(): # open a file wordsf = open('words.txt', 'r') word=random.choice('wordsf') words_count=0 for line in wordsf: word= line.rstrip('\n') print(word) words_count+=1 # close the file wordsf.close()

    Read the article

  • use/run python's 2to3 as or like a unittest

    - by Vincent
    I have used the 2to3 utility to convert code from the command line. What I would like to do is run it basically as a unittest. Even if it tests the file rather than parts(funtions, methods...) as would be normal for a unittest. It does not need to be a unittest and I don't what to automatically convert the files I just want to monitor the py3 compliance of files in a unittest like manor. I can't seem to find any documentation or examples for this. An example and/or documentation would be great. Thanks

    Read the article

  • Efficient file buffering & scanning methods for large files in python

    - by eblume
    The description of the problem I am having is a bit complicated, and I will err on the side of providing more complete information. For the impatient, here is the briefest way I can summarize it: What is the fastest (least execution time) way to split a text file in to ALL (overlapping) substrings of size N (bound N, eg 36) while throwing out newline characters. I am writing a module which parses files in the FASTA ascii-based genome format. These files comprise what is known as the 'hg18' human reference genome, which you can download from the UCSC genome browser (go slugs!) if you like. As you will notice, the genome files are composed of chr[1..22].fa and chr[XY].fa, as well as a set of other small files which are not used in this module. Several modules already exist for parsing FASTA files, such as BioPython's SeqIO. (Sorry, I'd post a link, but I don't have the points to do so yet.) Unfortunately, every module I've been able to find doesn't do the specific operation I am trying to do. My module needs to split the genome data ('CAGTACGTCAGACTATACGGAGCTA' could be a line, for instance) in to every single overlapping N-length substring. Let me give an example using a very small file (the actual chromosome files are between 355 and 20 million characters long) and N=8 import cStringIO example_file = cStringIO.StringIO("""\ header CAGTcag TFgcACF """) for read in parse(example_file): ... print read ... CAGTCAGTF AGTCAGTFG GTCAGTFGC TCAGTFGCA CAGTFGCAC AGTFGCACF The function that I found had the absolute best performance from the methods I could think of is this: def parse(file): size = 8 # of course in my code this is a function argument file.readline() # skip past the header buffer = '' for line in file: buffer += line.rstrip().upper() while len(buffer) = size: yield buffer[:size] buffer = buffer[1:] This works, but unfortunately it still takes about 1.5 hours (see note below) to parse the human genome this way. Perhaps this is the very best I am going to see with this method (a complete code refactor might be in order, but I'd like to avoid it as this approach has some very specific advantages in other areas of the code), but I thought I would turn this over to the community. Thanks! Note, this time includes a lot of extra calculation, such as computing the opposing strand read and doing hashtable lookups on a hash of approximately 5G in size. Post-answer conclusion: It turns out that using fileobj.read() and then manipulating the resulting string (string.replace(), etc.) took relatively little time and memory compared to the remainder of the program, and so I used that approach. Thanks everyone!

    Read the article

  • Optimizing BeautifulSoup (Python) code

    - by user283405
    I have code that uses the BeautifulSoup library for parsing, but it is very slow. The code is written in such a way that threads cannot be used. Can anyone help me with this? I am using BeautifulSoup for parsing and than save into a DB. If I comment out the save statement, it still takes a long time, so there is no problem with the database. def parse(self,text): soup = BeautifulSoup(text) arr = soup.findAll('tbody') for i in range(0,len(arr)-1): data=Data() soup2 = BeautifulSoup(str(arr[i])) arr2 = soup2.findAll('td') c=0 for j in arr2: if str(j).find("<a href=") > 0: data.sourceURL = self.getAttributeValue(str(j),'<a href="') else: if c == 2: data.Hits=j.renderContents() #and few others... c = c+1 data.save() Any suggestions? Note: I already ask this question here but that was closed due to incomplete information.

    Read the article

  • Python RegExp exception

    - by Jasie
    How do I split on all nonalphanumeric characters, EXCEPT the apostrophe? re.split('\W+',text) works, but will also split on apostrophes. How do I add an exception to this rule? Thanks!

    Read the article

< Previous Page | 152 153 154 155 156 157 158 159 160 161 162 163  | Next Page >