Sampling Duplicates

Posted by user3640982 on Stack Overflow See other posts from Stack Overflow or by user3640982
Published on 2014-06-09T15:18:15Z Indexed on 2014/06/09 15:25 UTC
Read the original article Hit count: 271

Filed under:

I have a dataset from which I need to sample. It is set up with an ID field and a year field. I want every record from the most current year and then I want the most current ID's but sampled from every 3rd year going back. The data is ordered by year.

For example

ID<-rep(1:3, 5)
Year<-rep(c(1,2,3,4,5),each=3)
df<-data.frame(ID,Year)

   ID Year
1   1    1
2   2    1
3   3    1
4   1    2
5   2    2
6   3    2
7   1    3
8   2    3
9   3    3
10  1    4
11  2    4
12  3    4
13  1    5
14  2    5
15  3    5

So from this example, I would want to return

  ID Year
1  1    1
2  2    1
3  3    1
4  1    4
5  2    4
6  3    4

I'm thinking that some combination of duplicated() and which() should get what I want, but the problem is duplicated() just tells if it has been repeated; it doesn't say which record is being repeated.

 which(duplicated(df$ID))
[1]  4  5  6  7  8  9 10 11 12 13 14 15

This a problem since not every ID exists in every year.

Any help would be appreciated.

Thanks, Eric

© Stack Overflow or respective owner

Related posts about sampling