skyl - Developer IT

Save memory in Python. How to iterate over the lines and save them efficiently with a 2million line

- by skyl

I have a tab-separated data file with a little over 2 million lines and 19 columns. You can find it, in US.zip: http://download.geonames.org/export/dump/. I started to run the following but with for l in f.readlines(). I understand that just iterating over the file is supposed to be more efficient so I'm posting that below. Still, with this small optimization, I'm using 10% of my memory on the process and have only done about 3% of the records. It looks like, at this pace, it will run out of memory like it did before. Also, the function I have is very slow. Is there anything obvious I can do to speed it up? Would it help to del the objects with each pass of the for loop? def run(): from geonames.models import POI f = file('data/US.txt') for l in f: li = l.split('\t') try: p = POI() p.geonameid = li[0] p.name = li[1] p.asciiname = li[2] p.alternatenames = li[3] p.point = "POINT(%s %s)" % (li[5], li[4]) p.feature_class = li[6] p.feature_code = li[7] p.country_code = li[8] p.ccs2 = li[9] p.admin1_code = li[10] p.admin2_code = li[11] p.admin3_code = li[12] p.admin4_code = li[13] p.population = li[14] p.elevation = li[15] p.gtopo30 = li[16] p.timezone = li[17] p.modification_date = li[18] p.save() except IndexError: pass if __name__ == "__main__": run()

Read the article

Get a queryset of objects through an intermediary model

- by skyl

I want get all of the Geom objects that are related to a certain content_object (see the function I'm trying to build at the bottom, get_geoms_for_obj() class Geom(models.Model): ... class GeomRelation(models.Model): ''' For tagging many objects to a Geom object and vice-versa''' geom = models.ForeignKey(Geom) content_type = models.ForeignKey(ContentType) object_id = models.PositiveIntegerField() content_object = generic.GenericForeignKey() def get_geoms_for_object(obj): ''' takes an object and gets the geoms that are related ''' ct = ContentType.objects.get_for_model(obj) id = obj.id grs = GeomRelation.objects.filter( content_type=ct, object_id=id ) # how with django orm magic can I build the queryset instead of list # like below to get all of the Geom objects for a given content_object geoms = [] for gr in grs: geoms.append(gr.geom) return set(geoms) # A set makes it so that I have no redundant entries but I want the # queryset ordering too .. need to make it a queryset for so many reasons...

Read the article

Foreign-key-like merge in R

- by skyl

I'm merging a bunch of csv with 1 row per id/pk/seqn. > full = merge(demo, lab13am, by="seqn", all=TRUE) > full = merge(full, cdq, by="seqn", all=TRUE) > full = merge(full, mcq, by="seqn", all=TRUE) > full = merge(full, cfq, by="seqn", all=TRUE) > full = merge(full, diq, by="seqn", all=TRUE) > print(length(full$ridageyr)) [1] 9965 > print(summary(full$ridageyr)) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 11.00 19.00 29.73 48.00 85.00 Everything is great. But, I have another file which has multiple rows per id like: "seqn","rxd030","rxd240b","nhcode","rxq250" 56,2,"","",NA,NA,"" 57,1,"ACETAMINOPHEN","01200",2 57,1,"BUDESONIDE","08800",1 58,1,"99999","",NA 57 has two rows. So, if I naively try to merge this file, I have a ton more rows and my data gets all skewed up. > full = merge(full, rxq, by="seqn", all=TRUE) > print(length(full$ridageyr)) [1] 15643 > print(summary(full$ridageyr)) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 14.00 41.00 40.28 66.00 85.00 Is there a normal idiomatic way to deal with data like this? Suppose I want a way to make a simple model like MYSPECIAL_FACTOR <- somehow() glm(MYSPECIAL_FACTOR ~ full$ridageyr, family=binomial) where MYSPECIAL_FACTOR is, say, whether or not rxd240b == "ACETAMINOPHEN" for the observations which are unique by seqn. You can reproduce by running the first bit of this.

Developer IT