I'm merging a bunch of csv with 1 row per id/pk/seqn.
> full = merge(demo, lab13am, by="seqn", all=TRUE)
> full = merge(full, cdq, by="seqn", all=TRUE)
> full = merge(full, mcq, by="seqn", all=TRUE)
> full = merge(full, cfq, by="seqn", all=TRUE)
> full = merge(full, diq, by="seqn", all=TRUE)
> print(length(full$ridageyr))
[1] 9965
> print(summary(full$ridageyr))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 11.00 19.00 29.73 48.00 85.00
Everything is great. But, I have another file which has multiple rows per id like:
"seqn","rxd030","rxd240b","nhcode","rxq250"
56,2,"","",NA,NA,""
57,1,"ACETAMINOPHEN","01200",2
57,1,"BUDESONIDE","08800",1
58,1,"99999","",NA
57 has two rows. So, if I naively try to merge this file, I have a ton more rows and my data gets all skewed up.
> full = merge(full, rxq, by="seqn", all=TRUE)
> print(length(full$ridageyr))
[1] 15643
> print(summary(full$ridageyr))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 14.00 41.00 40.28 66.00 85.00
Is there a normal idiomatic way to deal with data like this? Suppose I want a way to make a simple model like
MYSPECIAL_FACTOR <- somehow()
glm(MYSPECIAL_FACTOR ~ full$ridageyr, family=binomial)
where MYSPECIAL_FACTOR is, say, whether or not rxd240b == "ACETAMINOPHEN" for the observations which are unique by seqn. You can reproduce by running the first bit of this.