for comprehension - Page 3

Python: Improving long cumulative sum

- by Bo102010

I have a program that operates on a large set of experimental data. The data is stored as a list of objects that are instances of a class with the following attributes: time_point - the time of the sample cluster - the name of the cluster of nodes from which the sample was taken code - the name of the node from which the sample was taken qty1 = the value of the sample for the first quantity qty2 = the value of the sample for the second quantity I need to derive some values from the data set, grouped in three ways - once for the sample as a whole, once for each cluster of nodes, and once for each node. The values I need to derive depend on the (time sorted) cumulative sums of qty1 and qty2: the maximum value of the element-wise sum of the cumulative sums of qty1 and qty2, the time point at which that maximum value occurred, and the values of qty1 and qty2 at that time point. I came up with the following solution: dataset.sort(key=operator.attrgetter('time_point')) # For the whole set sys_qty1 = 0 sys_qty2 = 0 sys_combo = 0 sys_max = 0 # For the cluster grouping cluster_qty1 = defaultdict(int) cluster_qty2 = defaultdict(int) cluster_combo = defaultdict(int) cluster_max = defaultdict(int) cluster_peak = defaultdict(int) # For the node grouping node_qty1 = defaultdict(int) node_qty2 = defaultdict(int) node_combo = defaultdict(int) node_max = defaultdict(int) node_peak = defaultdict(int) for t in dataset: # For the whole system ###################################################### sys_qty1 += t.qty1 sys_qty2 += t.qty2 sys_combo = sys_qty1 + sys_qty2 if sys_combo > sys_max: sys_max = sys_combo # The Peak class is to record the time point and the cumulative quantities system_peak = Peak(time_point=t.time_point, qty1=sys_qty1, qty2=sys_qty2) # For the cluster grouping ################################################## cluster_qty1[t.cluster] += t.qty1 cluster_qty2[t.cluster] += t.qty2 cluster_combo[t.cluster] = cluster_qty1[t.cluster] + cluster_qty2[t.cluster] if cluster_combo[t.cluster] > cluster_max[t.cluster]: cluster_max[t.cluster] = cluster_combo[t.cluster] cluster_peak[t.cluster] = Peak(time_point=t.time_point, qty1=cluster_qty1[t.cluster], qty2=cluster_qty2[t.cluster]) # For the node grouping ##################################################### node_qty1[t.node] += t.qty1 node_qty2[t.node] += t.qty2 node_combo[t.node] = node_qty1[t.node] + node_qty2[t.node] if node_combo[t.node] > node_max[t.node]: node_max[t.node] = node_combo[t.node] node_peak[t.node] = Peak(time_point=t.time_point, qty1=node_qty1[t.node], qty2=node_qty2[t.node]) This produces the correct output, but I'm wondering if it can be made more readable/Pythonic, and/or faster/more scalable. The above is attractive in that it only loops through the (large) dataset once, but unattractive in that I've essentially copied/pasted three copies of the same algorithm. To avoid the copy/paste issues of the above, I tried this also: def find_peaks(level, dataset): def grouping(object, attr_name): if attr_name == 'system': return attr_name else: return object.__dict__[attrname] cuml_qty1 = defaultdict(int) cuml_qty2 = defaultdict(int) cuml_combo = defaultdict(int) level_max = defaultdict(int) level_peak = defaultdict(int) for t in dataset: cuml_qty1[grouping(t, level)] += t.qty1 cuml_qty2[grouping(t, level)] += t.qty2 cuml_combo[grouping(t, level)] = (cuml_qty1[grouping(t, level)] + cuml_qty2[grouping(t, level)]) if cuml_combo[grouping(t, level)] > level_max[grouping(t, level)]: level_max[grouping(t, level)] = cuml_combo[grouping(t, level)] level_peak[grouping(t, level)] = Peak(time_point=t.time_point, qty1=node_qty1[grouping(t, level)], qty2=node_qty2[grouping(t, level)]) return level_peak system_peak = find_peaks('system', dataset) cluster_peak = find_peaks('cluster', dataset) node_peak = find_peaks('node', dataset) For the (non-grouped) system-level calculations, I also came up with this, which is pretty: dataset.sort(key=operator.attrgetter('time_point')) def cuml_sum(seq): rseq = [] t = 0 for i in seq: t += i rseq.append(t) return rseq time_get = operator.attrgetter('time_point') q1_get = operator.attrgetter('qty1') q2_get = operator.attrgetter('qty2') timeline = [time_get(t) for t in dataset] cuml_qty1 = cuml_sum([q1_get(t) for t in dataset]) cuml_qty2 = cuml_sum([q2_get(t) for t in dataset]) cuml_combo = [q1 + q2 for q1, q2 in zip(cuml_qty1, cuml_qty2)] combo_max = max(cuml_combo) time_max = timeline.index(combo_max) q1_at_max = cuml_qty1.index(time_max) q2_at_max = cuml_qty2.index(time_max) However, despite this version's cool use of list comprehensions and zip(), it loops through the dataset three times just for the system-level calculations, and I can't think of a good way to do the cluster-level and node-level calaculations without doing something slow like: timeline = defaultdict(int) cuml_qty1 = defaultdict(int) #...etc. for c in cluster_list: timeline[c] = [time_get(t) for t in dataset if t.cluster == c] cuml_qty1[c] = [q1_get(t) for t in dataset if t.cluster == c] #...etc. Does anyone here at Stack Overflow have suggestions for improvements? The first snippet above runs well for my initial dataset (on the order of a million records), but later datasets will have more records and clusters/nodes, so scalability is a concern. This is my first non-trivial use of Python, and I want to make sure I'm taking proper advantage of the language (this is replacing a very convoluted set of SQL queries, and earlier versions of the Python version were essentially very ineffecient straight transalations of what that did). I don't normally do much programming, so I may be missing something elementary. Many thanks!

Search Results

Search found 149 results on 6 pages for 'for comprehension'.

Page 3/6 | < Previous Page | 1 2 3 4 5 6 | Next Page >

- by Bo102010

- by nornagon

- by Yuval A

- by Protean

- by user1440194

- by user248237

- by Sanich

- by user189637

- by PandemoniumSyndicate

- by user332912

- by qin

- by andylei

- by TarGz

- by froadie

- by user248237

- by Tim

- by Erik Nygren

- by mcstrother

- by pdknsk

- by Vonn

- by Daniel C. Sobral

- by kittensatplay

- by Aligned

- by kittensatplay

< Previous Page | 1 2 3 4 5 6 | Next Page >