R and version control for the solo data analyst
- by Jeromy Anglim
Many data analysts that I respect use version control.
For example:
http://github.com/hadley/
See comments on http://permut.wordpress.com/2010/04/21/revision-control-statistics-bleg/
However, I'm evaluating whether adopting a version control system such as git would be worthwhile.
A brief overview:
I'm a social scientist who uses R to analyse data for research publications.
I don't currently produce R packages.
My R code for a project typically includes a few thousand lines of code for data input, cleaning, manipulation, analyses, and output generation.
Publications are typically written using LaTeX.
With regards to version control there are many benefits which I have read about, yet they seem to be less relevant to the solo data analyst.
Backup: I have a backup system already in place.
Forking and rewinding: I've never felt the need to do this,
but I can see how it could be useful (e.g., you are preparing multiple
journal articles based on the same dataset; you are preparing a report
that is updated monthly, etc)
Collaboration: Most of the time I am
analysing data myself, thus, I
woudln't get the collaboration
benefits of version control.
There are also several potential costs involved with adopting version control:
Time to evaluate and learn a version control system
A possible increase in complexity over my current file management system
However, I still have the feeling that I'm missing something.
General guides on version control seem to be addressed more towards computer scientists than data analysts.
Thus, specifically in relation to data analysts in circumstances similar to those listed above:
Is version control worth the effort?
What are the main pros and cons of adopting version control?
What is a good strategy for getting started with version control
for data analysis with R (e.g., examples, workflow ideas, software, links to guides)?