Search Results

Search found 14399 results on 576 pages for 'python noob'.

Page 158/576 | < Previous Page | 154 155 156 157 158 159 160 161 162 163 164 165  | Next Page >

  • Look for match in a nested list in Python

    - by elfuego1
    Hello everybody, I have two nested lists of different sizes: A = [[1, 7, 3, 5], [5, 5, 14, 10]] B = [[1, 17, 3, 5], [1487, 34, 14, 74], [1487, 34, 3, 87], [141, 25, 14, 10]] I'd like to gather all nested lists from list B if A[2:4] == B[2:4] and put it into list L: L = [[1, 17, 3, 5], [141, 25, 14, 10]] Additionally if the match occurs then I want to change last element of sublist B into first element of sublist A so the final solution would look like this: L1 = [[1, 17, 3, 1], [141, 25, 14, 5]]

    Read the article

  • compute mean in python for a generator

    - by nmaxwell
    Hi, I'm doing some statistics work, I have a (large) collection of random numbers to compute the mean of, I'd like to work with generators, because I just need to compute the mean, so I don't need to store the numbers. The problem is that numpy.mean breaks if you pass it a generator. I can write a simple function to do what I want, but I'm wondering if there's a proper, built-in way to do this? It would be nice if I could say "sum(values)/len(values)", but len doesn't work for genetators, and sum already consumed values. here's an example: import numpy def my_mean(values): n = 0 Sum = 0.0 try: while True: Sum += next(values) n += 1 except StopIteration: pass return float(Sum)/n X = [k for k in range(1,7)] Y = (k for k in range(1,7)) print numpy.mean(X) print my_mean(Y) these both give the same, correct, answer, buy my_mean doesn't work for lists, and numpy.mean doesn't work for generators. I really like the idea of working with generators, but details like this seem to spoil things. thanks for any help -nick

    Read the article

  • Extra characters Extracted with XPath and Python (html)

    - by Nacari
    I have been using XPath with scrapy to extract text from html tags online, but when I do I get extra characters attached. An example is trying to extract a number, like "204" from a <td> tag and getting [u'204']. In some cases its much worse. For instance trying to extract "1 - Mathoverflow" and instead getting [u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t ']. Is there a way to prevent this, or trim the strings so that the extra characters arent a part of the string? (using items to store the data). It looks like it has something to do with formatting, so how do I get xpath to not pick up that stuff?

    Read the article

  • Python unicode search not giving correct answer

    - by user1318912
    I am trying to search hindi words contained one line per file in file-1 and find them in lines in file-2. I have to print the line numbers with the number of words found. This is the code: import codecs hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8").readlines() words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines() count_arr = [] for counter, line in enumerate(hypernyms): count_arr.append(0) for word in words: if line.find(word) >=0: count_arr[counter] +=1 for iterator, count in enumerate(count_arr): if count>0: print iterator, ' ', count This is finding some words, but ignoring some others The input files are: File-1: ???? ??????? File-2: ???????, ????-???? ?????-???, ?????-???, ?????_???, ?????_??? ????_????, ????-????, ???????_???? ????-???? This gives output: 0 1 3 1 Clearly, it is ignoring ??????? and searching for ???? only. I have tried with other inputs as well. It only searches for one word. Any idea how to correct this?

    Read the article

  • Python - pickling fails for numpy.void objects

    - by I82Much
    >>> idmapfile = open("idmap", mode="w") >>> pickle.dump(idMap, idmapfile) >>> idmapfile.close() >>> idmapfile = open("idmap") >>> unpickled = pickle.load(idmapfile) >>> unpickled == idMap False idMap[1] {1537: (552, 1, 1537, 17.793827056884766, 3), 1540: (4220, 1, 1540, 19.31205940246582, 3), 1544: (592, 1, 1544, 18.129131317138672, 3), 1675: (529, 1, 1675, 18.347782135009766, 3), 1550: (4048, 1, 1550, 19.31205940246582, 3), 1424: (1528, 1, 1424, 19.744396209716797, 3), 1681: (1265, 1, 1681, 19.596025466918945, 3), 1560: (3457, 1, 1560, 20.530569076538086, 3), 1690: (477, 1, 1690, 17.395542144775391, 3), 1691: (554, 1, 1691, 13.446117401123047, 3), 1436: (3010, 1, 1436, 19.596025466918945, 3), 1434: (3183, 1, 1434, 19.744396209716797, 3), 1441: (3570, 1, 1441, 20.589576721191406, 3), 1435: (476, 1, 1435, 19.640911102294922, 3), 1444: (527, 1, 1444, 17.98480224609375, 3), 1478: (1897, 1, 1478, 19.596025466918945, 3), 1575: (614, 1, 1575, 19.371648788452148, 3), 1586: (2189, 1, 1586, 19.31205940246582, 3), 1716: (3470, 1, 1716, 19.158674240112305, 3), 1590: (2278, 1, 1590, 19.596025466918945, 3), 1463: (991, 1, 1463, 19.31205940246582, 3), 1594: (1890, 1, 1594, 19.596025466918945, 3), 1467: (1087, 1, 1467, 19.31205940246582, 3), 1596: (3759, 1, 1596, 19.744396209716797, 3), 1602: (3011, 1, 1602, 20.530569076538086, 3), 1547: (490, 1, 1547, 17.994071960449219, 3), 1605: (658, 1, 1605, 19.31205940246582, 3), 1606: (1794, 1, 1606, 16.964881896972656, 3), 1719: (1826, 1, 1719, 19.596025466918945, 3), 1617: (583, 1, 1617, 11.894925117492676, 3), 1492: (3441, 1, 1492, 20.500667572021484, 3), 1622: (3215, 1, 1622, 19.31205940246582, 3), 1628: (2761, 1, 1628, 19.744396209716797, 3), 1502: (1563, 1, 1502, 19.596025466918945, 3), 1632: (1108, 1, 1632, 15.457141876220703, 3), 1468: (3779, 1, 1468, 19.596025466918945, 3), 1642: (3970, 1, 1642, 19.744396209716797, 3), 1518: (612, 1, 1518, 18.570245742797852, 3), 1647: (854, 1, 1647, 16.964881896972656, 3), 1650: (2099, 1, 1650, 20.439058303833008, 3), 1651: (540, 1, 1651, 18.552841186523438, 3), 1653: (613, 1, 1653, 19.237197875976563, 3), 1532: (537, 1, 1532, 18.885730743408203, 3)} >>> unpickled[1] {1537: (64880, 1638, 56700, -1.0808743559293829e+18, 152), 1540: (64904, 1638, 0, 0.0, 0), 1544: (54472, 1490, 0, 0.0, 0), 1675: (6464, 1509, 0, 0.0, 0), 1550: (43592, 1510, 0, 0.0, 0), 1424: (43616, 1510, 0, 0.0, 0), 1681: (0, 0, 0, 0.0, 0), 1560: (400, 152, 400, 2.1299736657737219e-43, 0), 1690: (408, 152, 408, 2.7201111331839077e+26, 34), 1435: (424, 152, 61512, 1.0122952080313192e-39, 0), 1436: (400, 152, 400, 20.250289916992188, 3), 1434: (424, 152, 62080, 1.0122952080313192e-39, 0), 1441: (400, 152, 400, 12.250144958496094, 3), 1691: (424, 152, 42608, 15.813941955566406, 3), 1444: (400, 152, 400, 19.625289916992187, 3), 1606: (424, 152, 42432, 5.2947192852601414e-22, 41), 1575: (400, 152, 400, 6.2537390010262572e-36, 0), 1586: (424, 152, 42488, 1.0122601755697111e-39, 0), 1716: (400, 152, 400, 6.2537390010262572e-36, 0), 1590: (424, 152, 64144, 1.0126357235581501e-39, 0), 1463: (400, 152, 400, 6.2537390010262572e-36, 0), 1594: (424, 152, 32672, 17.002994537353516, 3), 1467: (400, 152, 400, 19.750289916992187, 3), 1596: (424, 152, 7176, 1.0124003054161436e-39, 0), 1602: (400, 152, 400, 18.500289916992188, 3), 1547: (424, 152, 7000, 1.0124003054161436e-39, 0), 1605: (400, 152, 400, 20.500289916992188, 3), 1478: (424, 152, 42256, -6.0222748507426518e+30, 222), 1719: (400, 152, 400, 6.2537390010262572e-36, 0), 1617: (424, 152, 16472, 1.0124283313854301e-39, 0), 1492: (400, 152, 400, 6.2537390010262572e-36, 0), 1622: (424, 152, 35304, 1.0123190301052127e-39, 0), 1628: (400, 152, 400, 6.2537390010262572e-36, 0), 1502: (424, 152, 63152, 19.627988815307617, 3), 1632: (400, 152, 400, 19.375289916992188, 3), 1468: (424, 152, 38088, 1.0124213248931084e-39, 0), 1642: (400, 152, 400, 6.2537390010262572e-36, 0), 1518: (424, 152, 63896, 1.0127436235399031e-39, 0), 1647: (400, 152, 400, 6.2537390010262572e-36, 0), 1650: (424, 152, 53424, 16.752857208251953, 3), 1651: (400, 152, 400, 19.250289916992188, 3), 1653: (424, 152, 50624, 1.0126497365427934e-39, 0), 1532: (400, 152, 400, 6.2537390010262572e-36, 0)} The keys come out fine, the values are screwed up. I tried same thing loading file in binary mode; didn't fix the problem. Any idea what I'm doing wrong? Edit: Here's the code with binary. Note that the values are different in the unpickled object. >>> idmapfile = open("idmap", mode="wb") >>> pickle.dump(idMap, idmapfile) >>> idmapfile.close() >>> idmapfile = open("idmap", mode="rb") >>> unpickled = pickle.load(idmapfile) >>> unpickled==idMap False >>> unpickled[1] {1537: (12176, 2281, 56700, -1.0808743559293829e+18, 152), 1540: (0, 0, 15934, 2.7457842047810522e+26, 108), 1544: (400, 152, 400, 4.9518498821046956e+27, 53), 1675: (408, 152, 408, 2.7201111331839077e+26, 34), 1550: (456, 152, 456, -1.1349175514578289e+18, 152), 1424: (432, 152, 432, 4.5939047815653343e-40, 11), 1681: (408, 152, 408, 2.1299736657737219e-43, 0), 1560: (376, 152, 376, 2.1299736657737219e-43, 0), 1690: (376, 152, 376, 2.1299736657737219e-43, 0), 1435: (376, 152, 376, 2.1299736657737219e-43, 0), 1436: (376, 152, 376, 2.1299736657737219e-43, 0), 1434: (376, 152, 376, 2.1299736657737219e-43, 0), 1441: (376, 152, 376, 2.1299736657737219e-43, 0), 1691: (376, 152, 376, 2.1299736657737219e-43, 0), 1444: (376, 152, 376, 2.1299736657737219e-43, 0), 1606: (25784, 2281, 376, -3.2883343074537754e+26, 34), 1575: (24240, 2281, 376, 2.1299736657737219e-43, 0), 1586: (24240, 2281, 376, 2.1299736657737219e-43, 0), 1716: (24240, 2281, 376, -3.0093091599657311e-35, 26), 1590: (24240, 2281, 376, 2.1299736657737219e-43, 0), 1463: (24240, 2281, 376, 2.1299736657737219e-43, 0), 1594: (24240, 2281, 376, -4123208450048.0, 196), 1467: (25784, 2281, 376, 2.1299736657737219e-43, 0), 1596: (25784, 2281, 376, 2.1299736657737219e-43, 0), 1602: (25784, 2281, 376, -5.9963281433905448e+26, 76), 1547: (25784, 2281, 376, -218106240.0, 139), 1605: (25784, 2281, 376, -3.7138649803377281e+27, 56), 1478: (376, 152, 376, 2.1299736657737219e-43, 0), 1719: (25784, 2281, 376, 2.1299736657737219e-43, 0), 1617: (25784, 2281, 376, -1.4411779941597184e+17, 237), 1492: (25784, 2281, 376, 2.8596493694487798e-30, 80), 1622: (25784, 2281, 376, 184686084096.0, 93), 1628: (1336, 152, 1336, 3.1691839245470052e+29, 179), 1502: (1272, 152, 1272, -5.2042207205116645e-17, 99), 1632: (1208, 152, 1208, 2.1299736657737219e-43, 0), 1468: (1144, 152, 1144, 2.1299736657737219e-43, 0), 1642: (1080, 152, 1080, 2.1299736657737219e-43, 0), 1518: (1016, 152, 1016, 4.0240902787680023e+35, 145), 1647: (952, 152, 952, -985172619034624.0, 237), 1650: (888, 152, 888, 12094787289088.0, 66), 1651: (824, 152, 824, 2.1299736657737219e-43, 0), 1653: (760, 152, 760, 0.00018310768064111471, 238), 1532: (696, 152, 696, 8.8978061885676389e+26, 125)} OK I've isolated the problem, but don't know why it's so. First, apparently what I'm pickling are not tuples (though they look like it), but instead numpy.void types. Here is a series to illustrate the problem. first = run0.detections[0] >>> first (1, 19, 1578, 82.637763977050781, 1) >>> type(first) <type 'numpy.void'> >>> firstTuple = tuple(first) >>> theFile = open("pickleTest", "w") >>> pickle.dump(first, theFile) >>> theTupleFile = open("pickleTupleTest", "w") >>> pickle.dump(firstTuple, theTupleFile) >>> theFile.close() >>> theTupleFile.close() >>> first (1, 19, 1578, 82.637763977050781, 1) >>> firstTuple (1, 19, 1578, 82.637764, 1) >>> theFile = open("pickleTest", "r") >>> theTupleFile = open("pickleTupleTest", "r") >>> unpickledTuple = pickle.load(theTupleFile) >>> unpickledVoid = pickle.load(theFile) >>> type(unpickledVoid) <type 'numpy.void'> >>> type(unpickledTuple) <type 'tuple'> >>> unpickledTuple (1, 19, 1578, 82.637764, 1) >>> unpickledTuple == firstTuple True >>> unpickledVoid == first False >>> unpickledVoid (7936, 1705, 56700, -1.0808743559293829e+18, 152) >>> first (1, 19, 1578, 82.637763977050781, 1)

    Read the article

  • Caching result of setUp() using Python unittest

    - by dbr
    I currently have a unittest.TestCase that looks like.. class test_appletrailer(unittest.TestCase): def setup(self): self.all_trailers = Trailers(res = "720", verbose = True) def test_has_trailers(self): self.failUnless(len(self.all_trailers) > 1) # ..more tests.. This works fine, but the Trailers() call takes about 2 seconds to run.. Given that setUp() is called before each test is run, the tests now take almost 10 seconds to run (with only 3 test functions) What is the correct way of caching the self.all_trailers variable between tests? Removing the setUp function, and doing.. class test_appletrailer(unittest.TestCase): all_trailers = Trailers(res = "720", verbose = True) ..works, but then it claims "Ran 3 tests in 0.000s" which is incorrect.. The only other way I could think of is to have a cache_trailers global variable (which works correctly, but is rather horrible): cache_trailers = None class test_appletrailer(unittest.TestCase): def setUp(self): global cache_trailers if cache_trailers is None: cache_trailers = self.all_trailers = all_trailers = Trailers(res = "720", verbose = True) else: self.all_trailers = cache_trailers

    Read the article

  • Optimizing python link matching regular expression

    - by Matt
    I have a regular expression, links = re.compile('<a(.+?)href=(?:"|\')?((?:https?://|/)[^\'"]+)(?:"|\')?(.*?)>(.+?)</a>',re.I).findall(data) to find links in some html, it is taking a long time on certain html, any optimization advice? One that it chokes on is http://freeyourmindonline.net/Blog/

    Read the article

  • python fdb save huge data from database to file

    - by peter
    I have this script SELECT = """ select coalesce (p.ID,'') as id, coalesce (p.name,'') as name, from TABLE as p """ self.cur.execute(SELECT) for row in self.cur.itermap(): xml +=" <item>\n" xml +=" <id>" + id + "</id>\n" xml +=" <name>" + name + "</name>\n" xml +=" </item>\n\n" #save xml to file here f = open... and I need to save data from huge database to file. There are 10 000s (up to 40000) of items in my database and it takes very long time when script runs (1 hour and more) until finish. How can I take data I need from database and save it to file "at once"? (as quick as possible? I don't need xml output because I can process data from output on my server later. I just need to do it as quickly as possible. Any idea?) Many thanks!

    Read the article

  • strip spaces in python.

    - by Richard
    ok I know that this should be simple... anyways say: line = "$W5M5A,100527,142500,730301c44892fd1c,2,686.5 4,333.96,0,0,28.6,123,75,-0.4,1.4*49" I want to strip out the spaces. I thought you would just do this line = line.strip() but now line is still '$W5M5A,100527,142500,730301c44892fd1c,2,686.5 4,333.96,0,0,28.6,123,75,-0.4,1.4*49' instead of '$W5M5A,100527,142500,730301c44892fd1c,2,686.54,333.96,0,0,28.6,123,75,-0.4,1.4*49' any thoughts?

    Read the article

  • Dynamic Operator Overloading on dict classes in Python

    - by Ishpeck
    I have a class that dynamically overloads basic arithmetic operators like so... import operator class IshyNum: def __init__(self, n): self.num=n self.buildArith() def arithmetic(self, other, o): return o(self.num, other) def buildArith(self): map(lambda o: setattr(self, "__%s__"%o,lambda f: self.arithmetic(f, getattr(operator, o))), ["add", "sub", "mul", "div"]) if __name__=="__main__": number=IshyNum(5) print number+5 print number/2 print number*3 print number-3 But if I change the class to inherit from the dictionary (class IshyNum(dict):) it doesn't work. I need to explicitly def __add__(self, other) or whatever in order for this to work. Why?

    Read the article

  • Python metaprogramming help

    - by Timmy
    im looking into mongoengine, and i wanted to make a class an "EmbeddedDocument" dynamically, so i do this def custom(cls): cls = type( cls.__name__, (EmbeddedDocument,), cls.__dict__.copy() ) cls.a = FloatField(required=True) cls.b = FloatField(required=True) return cls A = custom( A ) and tried it on some classes, but its not doing some of the base class's init or sumthing in BaseDocument def __init__(self, **values): self._data = {} # Assign initial values to instance for attr_name, attr_value in self._fields.items(): if attr_name in values: setattr(self, attr_name, values.pop(attr_name)) else: # Use default value if present value = getattr(self, attr_name, None) setattr(self, attr_name, value) but this never gets used, thus never setting ._data, and giving me errors. how do i do this?

    Read the article

  • Python: Taking an array and break it into subarrays based on some criteria

    - by randombits
    I have an array of files. I'd like to be able to break that array down into one array with multiple subarrays, each subarray contains files that were created on the same day. So right now if the array contains files from March 1 - March 31, I'd like to have an array with 31 subarrays (assuming there is at least 1 file for each day). In the long run, I'm trying to find the file from each day with the latest creation/modification time. If there is a way to bundle that into the iterations that are required above to save some CPU cycles, that would be even more ideal. Then I'd have one flat array with 31 files, one for each day, for the latest file created on each individual day.

    Read the article

  • supply inputs to python unittests

    - by zubin71
    I`m relatively new to the concept of unit-testing and have very little experience in the same. I have been looking at lots of articles on how to write unit-tests; however, I still have difficulty in writing tests where conditions like the following arise:- Test user Input. Test input read from a file. Test input read from an environment variable. Itd be great if someone could show me how to approach the above mentioned scenarios; itd still be awesome if you could point me to a few docs/articles/blog posts which I could read.

    Read the article

  • Optimizing python code performance when importing zipped csv to a mongo collection

    - by mark
    I need to import a zipped csv into a mongo collection, but there is a catch - every record contains a timestamp in Pacific Time, which must be converted to the local time corresponding to the (longitude,latitude) pair found in the same record. The code looks like so: def read_csv_zip(path, timezones): with ZipFile(path) as z, z.open(z.namelist()[0]) as input: csv_rows = csv.reader(input) header = csv_rows.next() check,converters = get_aux_stuff(header) for csv_row in csv_rows: if check(csv_row): row = { converter[0]:converter[1](value) for converter, value in zip(converters, csv_row) if allow_field(converter) } ts = row['ts'] lng, lat = row['loc'] found_tz_entry = timezones.find_one(SON({'loc': {'$within': {'$box': [[lng-tz_lookup_radius, lat-tz_lookup_radius],[lng+tz_lookup_radius, lat+tz_lookup_radius]]}}})) if found_tz_entry: tz_name = found_tz_entry['tz'] local_ts = ts.astimezone(timezone(tz_name)).replace(tzinfo=None) row['tz'] = tz_name else: local_ts = (ts.astimezone(utc) + timedelta(hours = int(lng/15))).replace(tzinfo = None) row['local_ts'] = local_ts yield row def insert_documents(collection, source, batch_size): while True: items = list(itertools.islice(source, batch_size)) if len(items) == 0: break; try: collection.insert(items) except: for item in items: try: collection.insert(item) except Exception as exc: print("Failed to insert record {0} - {1}".format(item['_id'], exc)) def main(zip_path): with Connection() as connection: data = connection.mydb.data timezones = connection.timezones.data insert_documents(data, read_csv_zip(zip_path, timezones), 1000) The code proceeds as follows: Every record read from the csv is checked and converted to a dictionary, where some fields may be skipped, some titles be renamed (from those appearing in the csv header), some values may be converted (to datetime, to integers, to floats. etc ...) For each record read from the csv, a lookup is made into the timezones collection to map the record location to the respective time zone. If the mapping is successful - that timezone is used to convert the record timestamp (pacific time) to the respective local timestamp. If no mapping is found - a rough approximation is calculated. The timezones collection is appropriately indexed, of course - calling explain() confirms it. The process is slow. Naturally, having to query the timezones collection for every record kills the performance. I am looking for advises on how to improve it. Thanks. EDIT The timezones collection contains 8176040 records, each containing four values: > db.data.findOne() { "_id" : 3038814, "loc" : [ 1.48333, 42.5 ], "tz" : "Europe/Andorra" } EDIT2 OK, I have compiled a release build of http://toblerity.github.com/rtree/ and configured the rtree package. Then I have created an rtree dat/idx pair of files corresponding to my timezones collection. So, instead of calling collection.find_one I call index.intersection. Surprisingly, not only there is no improvement, but it works even more slowly now! May be rtree could be fine tuned to load the entire dat/idx pair into RAM (704M), but I do not know how to do it. Until then, it is not an alternative. In general, I think the solution should involve parallelization of the task. EDIT3 Profile output when using collection.find_one: >>> p.sort_stats('cumulative').print_stats(10) Tue Apr 10 14:28:39 2012 ImportDataIntoMongo.profile 64549590 function calls (64549180 primitive calls) in 1231.257 seconds Ordered by: cumulative time List reduced from 730 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.012 0.012 1231.257 1231.257 ImportDataIntoMongo.py:1(<module>) 1 0.001 0.001 1230.959 1230.959 ImportDataIntoMongo.py:187(main) 1 853.558 853.558 853.558 853.558 {raw_input} 1 0.598 0.598 370.510 370.510 ImportDataIntoMongo.py:165(insert_documents) 343407 9.965 0.000 359.034 0.001 ImportDataIntoMongo.py:137(read_csv_zip) 343408 2.927 0.000 287.035 0.001 c:\python27\lib\site-packages\pymongo\collection.py:489(find_one) 343408 1.842 0.000 274.803 0.001 c:\python27\lib\site-packages\pymongo\cursor.py:699(next) 343408 2.542 0.000 271.212 0.001 c:\python27\lib\site-packages\pymongo\cursor.py:644(_refresh) 343408 4.512 0.000 253.673 0.001 c:\python27\lib\site-packages\pymongo\cursor.py:605(__send_message) 343408 0.971 0.000 242.078 0.001 c:\python27\lib\site-packages\pymongo\connection.py:871(_send_message_with_response) Profile output when using index.intersection: >>> p.sort_stats('cumulative').print_stats(10) Wed Apr 11 16:21:31 2012 ImportDataIntoMongo.profile 41542960 function calls (41542536 primitive calls) in 2889.164 seconds Ordered by: cumulative time List reduced from 778 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.028 0.028 2889.164 2889.164 ImportDataIntoMongo.py:1(<module>) 1 0.017 0.017 2888.679 2888.679 ImportDataIntoMongo.py:202(main) 1 2365.526 2365.526 2365.526 2365.526 {raw_input} 1 0.766 0.766 502.817 502.817 ImportDataIntoMongo.py:180(insert_documents) 343407 9.147 0.000 491.433 0.001 ImportDataIntoMongo.py:152(read_csv_zip) 343406 0.571 0.000 391.394 0.001 c:\python27\lib\site-packages\rtree-0.7.0-py2.7.egg\rtree\index.py:384(intersection) 343406 379.957 0.001 390.824 0.001 c:\python27\lib\site-packages\rtree-0.7.0-py2.7.egg\rtree\index.py:435(_intersection_obj) 686513 22.616 0.000 38.705 0.000 c:\python27\lib\site-packages\rtree-0.7.0-py2.7.egg\rtree\index.py:451(_get_objects) 343406 6.134 0.000 33.326 0.000 ImportDataIntoMongo.py:162(<dictcomp>) 346 0.396 0.001 30.665 0.089 c:\python27\lib\site-packages\pymongo\collection.py:240(insert) EDIT4 I have parallelized the code, but the results are still not very encouraging. I am convinced it could be done better. See my own answer to this question for details.

    Read the article

  • Efficient file buffering & scanning methods for large files in python

    - by eblume
    The description of the problem I am having is a bit complicated, and I will err on the side of providing more complete information. For the impatient, here is the briefest way I can summarize it: What is the fastest (least execution time) way to split a text file in to ALL (overlapping) substrings of size N (bound N, eg 36) while throwing out newline characters. I am writing a module which parses files in the FASTA ascii-based genome format. These files comprise what is known as the 'hg18' human reference genome, which you can download from the UCSC genome browser (go slugs!) if you like. As you will notice, the genome files are composed of chr[1..22].fa and chr[XY].fa, as well as a set of other small files which are not used in this module. Several modules already exist for parsing FASTA files, such as BioPython's SeqIO. (Sorry, I'd post a link, but I don't have the points to do so yet.) Unfortunately, every module I've been able to find doesn't do the specific operation I am trying to do. My module needs to split the genome data ('CAGTACGTCAGACTATACGGAGCTA' could be a line, for instance) in to every single overlapping N-length substring. Let me give an example using a very small file (the actual chromosome files are between 355 and 20 million characters long) and N=8 import cStringIO example_file = cStringIO.StringIO("""\ header CAGTcag TFgcACF """) for read in parse(example_file): ... print read ... CAGTCAGTF AGTCAGTFG GTCAGTFGC TCAGTFGCA CAGTFGCAC AGTFGCACF The function that I found had the absolute best performance from the methods I could think of is this: def parse(file): size = 8 # of course in my code this is a function argument file.readline() # skip past the header buffer = '' for line in file: buffer += line.rstrip().upper() while len(buffer) = size: yield buffer[:size] buffer = buffer[1:] This works, but unfortunately it still takes about 1.5 hours (see note below) to parse the human genome this way. Perhaps this is the very best I am going to see with this method (a complete code refactor might be in order, but I'd like to avoid it as this approach has some very specific advantages in other areas of the code), but I thought I would turn this over to the community. Thanks! Note, this time includes a lot of extra calculation, such as computing the opposing strand read and doing hashtable lookups on a hash of approximately 5G in size. Post-answer conclusion: It turns out that using fileobj.read() and then manipulating the resulting string (string.replace(), etc.) took relatively little time and memory compared to the remainder of the program, and so I used that approach. Thanks everyone!

    Read the article

  • how can i randomly print an element from a list in python

    - by lm
    So far i have this, which prints out every word in my list, but i am trying to print only one word at random. Any suggestions? def main(): # open a file wordsf = open('words.txt', 'r') word=random.choice('wordsf') words_count=0 for line in wordsf: word= line.rstrip('\n') print(word) words_count+=1 # close the file wordsf.close()

    Read the article

  • Python - Subprocess Popen and Thread error

    - by n0idea
    In both functions record and ftp, i have subprocess.Popen if __name__ == '__main__': try: t1 = threading.Thread(target = record) t1.daemon = True t1.start() t2 = threading.Thread(target = ftp) t2.daemon = True t2.start() except (KeyboardInterrupt, SystemExit): sys.exit() The error I'm receiving is: Exception in thread Thread-1 (most likely raised during interpreter shutdown): Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner File "/usr/lib/python2.7/threading.py", line 504, in run File "./in.py", line 20, in recordaudio File "/usr/lib/python2.7/subprocess.py", line 493, in call File "/usr/lib/python2.7/subprocess.py", line 679, in __init__ File "/usr/lib/python2.7/subprocess.py", line 1237, in _execute_child <type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'close' What might the issue be ?

    Read the article

  • Python: Unpack arbitary length bits for database storage

    - by sberry2A
    I have a binary data format consisting of 18,000+ packed int64s, ints, shorts, bytes and chars. The data is packed to minimize it's size, so they don't always use byte sized chunks. For example, a number whose min and max value are 31, 32 respectively might be stored with a single bit where the actual value is bitvalue + min, so 0 is 31 and 1 is 32. I am looking for the most efficient way to unpack all of these for subsequent processing and database storage. Right now I am able to read any value by using either struct.unpack, or BitBuffer. I use struct.unpack for any data that starts on a bit where (bit-offset % 8 == 0 and data-length % 8 == 0) and I use BitBuffer for anything else. I know the offset and size of every packed piece of data, so what is going to be the fasted way to completely unpack them? Many thanks.

    Read the article

  • python cairoplot store previous readings..

    - by krisdigitx
    hi, i am using cairoplot, to make graphs, however the file from where i am reading the data is growing huge and its taking a long time to process the graph is there any real-time way to produce cairo graph, or at least store the previous readings..like rrd. -krisdigitx

    Read the article

  • Python/YACC Lexer: Token priority?

    - by Rosarch
    I'm trying to use reserved words in my grammar: reserved = { 'if' : 'IF', 'then' : 'THEN', 'else' : 'ELSE', 'while' : 'WHILE', } tokens = [ 'DEPT_CODE', 'COURSE_NUMBER', 'OR_CONJ', 'ID', ] + list(reserved.values()) t_DEPT_CODE = r'[A-Z]{2,}' t_COURSE_NUMBER = r'[0-9]{4}' t_OR_CONJ = r'or' t_ignore = ' \t' def t_ID(t): r'[a-zA-Z_][a-zA-Z_0-9]*' if t.value in reserved.values(): t.type = reserved[t.value] return t return None However, the t_ID rule somehow swallows up DEPT_CODE and OR_CONJ. How can I get around this? I'd like those two to take higher precedence than the reserved words.

    Read the article

< Previous Page | 154 155 156 157 158 159 160 161 162 163 164 165  | Next Page >