Foreign-key-like merge in R

Posted by skyl on Stack Overflow See other posts from Stack Overflow or by skyl
Published on 2012-07-07T02:41:36Z Indexed on 2012/07/07 3:15 UTC
Read the original article Hit count: 120

Filed under:

I'm merging a bunch of csv with 1 row per id/pk/seqn.

> full = merge(demo, lab13am, by="seqn", all=TRUE)
> full = merge(full, cdq, by="seqn", all=TRUE)
> full = merge(full, mcq, by="seqn", all=TRUE)
> full = merge(full, cfq, by="seqn", all=TRUE)
> full = merge(full, diq, by="seqn", all=TRUE)
> print(length(full$ridageyr))
[1] 9965
> print(summary(full$ridageyr))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   11.00   19.00   29.73   48.00   85.00 

Everything is great. But, I have another file which has multiple rows per id like:

"seqn","rxd030","rxd240b","nhcode","rxq250"
56,2,"","",NA,NA,""
57,1,"ACETAMINOPHEN","01200",2
57,1,"BUDESONIDE","08800",1
58,1,"99999","",NA

57 has two rows. So, if I naively try to merge this file, I have a ton more rows and my data gets all skewed up.

> full = merge(full, rxq, by="seqn", all=TRUE)
> print(length(full$ridageyr))
[1] 15643
> print(summary(full$ridageyr))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   14.00   41.00   40.28   66.00   85.00 

Is there a normal idiomatic way to deal with data like this? Suppose I want a way to make a simple model like

MYSPECIAL_FACTOR <- somehow()
glm(MYSPECIAL_FACTOR ~ full$ridageyr, family=binomial)

where MYSPECIAL_FACTOR is, say, whether or not rxd240b == "ACETAMINOPHEN" for the observations which are unique by seqn. You can reproduce by running the first bit of this.

© Stack Overflow or respective owner

Related posts about r