Your thoughts on Best Practices for Scientific Computing?
Posted
by
John Smith
on Programmers
See other posts from Programmers
or by John Smith
Published on 2014-06-09T00:34:43Z
Indexed on
2014/06/09
3:41 UTC
Read the original article
Hit count: 805
programming-practices
|unit-testing
|version-control
|documentation
|defensive-programming
A recent paper by Wilson et al (2014) pointed out 24 Best Practices for scientific programming. It's worth to have a look. I would like to hear opinions about these points from experienced programmers in scientific data analysis. Do you think these advices are helpful and practical? Or are they good only in an ideal world?
Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, Guy RT, Haddock SHD, Huff KD, Mitchell IM, Plumbley MD, Waugh B, White EP, Wilson P (2014) Best Practices for Scientific Computing. PLoS Biol 12:e1001745.
http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001745
Box 1. Summary of Best Practices
Write programs for people, not computers.
(a) A program should not require its readers to hold more than a handful of facts in memory at once.
(b) Make names consistent, distinctive, and meaningful.
(c) Make code style and formatting consistent.
Let the computer do the work.
(a) Make the computer repeat tasks.
(b) Save recent commands in a file for re-use.
(c) Use a build tool to automate workflows.
Make incremental changes.
(a) Work in small steps with frequent feedback and course correction.
(b) Use a version control system.
(c) Put everything that has been created manually in version control.
Don’t repeat yourself (or others).
(a) Every piece of data must have a single authoritative representation in the system.
(b) Modularize code rather than copying and pasting.
(c) Re-use code instead of rewriting it.
Plan for mistakes.
(a) Add assertions to programs to check their operation.
(b) Use an off-the-shelf unit testing library.
(c) Turn bugs into test cases.
(d) Use a symbolic debugger.
Optimize software only after it works correctly.
(a) Use a profiler to identify bottlenecks.
(b) Write code in the highest-level language possible.
Document design and purpose, not mechanics.
(a) Document interfaces and reasons, not implementations.
(b) Refactor code in preference to explaining how it works.
(c) Embed the documentation for a piece of software in that software.
Collaborate.
(a) Use pre-merge code reviews.
(b) Use pair programming when bringing someone new up to speed and when tackling particularly tricky problems.
(c) Use an issue tracking tool.
I'm relatively new to serious programming for scientific data analysis. When I tried to write code for pilot analyses of some of my data last year, I encountered tremendous amount of bugs both in my code and data. Bugs and errors had been around me all the time, but this time it was somewhat overwhelming. I managed to crunch the numbers at last, but I thought I couldn't put up with this mess any longer. Some actions must be taken.
Without a sophisticated guide like the article above, I started to adopt "defensive style" of programming since then. A book titled "The Art of Readable Code" helped me a lot. I deployed meticulous input validations or assertions for every function, renamed a lot of variables and functions for better readability, and extracted many subroutines as reusable functions. Recently, I introduced Git
and SourceTree
for version control.
At the moment, because my co-workers are much more reluctant about these issues, the collaboration practices (8a,b,c) have not been introduced. Actually, as the authors admitted, because all of these practices take some amount of time and effort to introduce, it may be generally hard to persuade your reluctant collaborators to comply them.
I think I'm asking your opinions because I still suffer from many bugs despite all my effort on many of these practices. Bug fix may be, or should be, faster than before, but I couldn't really measure the improvement. Moreover, much of my time has been invested on defence, meaning that I haven't actually done much data analysis (offence) these days. Where is the point I should stop at in terms of productivity?
I've already deployed: 1a,b,c, 2a, 3a,b,c, 4b,c, 5a,d, 6a,b, 7a,7b
I'm about to have a go at: 5b,c
Not yet: 2b,c, 4a, 7c, 8a,b,c
(I could not really see the advantage of using GNU make
(2c) for my purpose. Could anyone tell me how it helps my work with MATLAB?)
© Programmers or respective owner