R and version control for the solo data analyst
Posted
by Jeromy Anglim
on Stack Overflow
See other posts from Stack Overflow
or by Jeromy Anglim
Published on 2010-04-26T09:46:43Z
Indexed on
2010/04/26
10:23 UTC
Read the original article
Hit count: 786
Many data analysts that I respect use version control. For example:
- http://github.com/hadley/
- See comments on http://permut.wordpress.com/2010/04/21/revision-control-statistics-bleg/
However, I'm evaluating whether adopting a version control system such as git would be worthwhile.
A brief overview: I'm a social scientist who uses R to analyse data for research publications. I don't currently produce R packages. My R code for a project typically includes a few thousand lines of code for data input, cleaning, manipulation, analyses, and output generation. Publications are typically written using LaTeX.
With regards to version control there are many benefits which I have read about, yet they seem to be less relevant to the solo data analyst.
- Backup: I have a backup system already in place.
- Forking and rewinding: I've never felt the need to do this, but I can see how it could be useful (e.g., you are preparing multiple journal articles based on the same dataset; you are preparing a report that is updated monthly, etc)
- Collaboration: Most of the time I am analysing data myself, thus, I woudln't get the collaboration benefits of version control.
There are also several potential costs involved with adopting version control:
- Time to evaluate and learn a version control system
- A possible increase in complexity over my current file management system
However, I still have the feeling that I'm missing something. General guides on version control seem to be addressed more towards computer scientists than data analysts.
Thus, specifically in relation to data analysts in circumstances similar to those listed above:
- Is version control worth the effort?
- What are the main pros and cons of adopting version control?
- What is a good strategy for getting started with version control for data analysis with R (e.g., examples, workflow ideas, software, links to guides)?
© Stack Overflow or respective owner