An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.
-- Jonathan Buckheit and David Donoho, paraphrasing Jon Claerbout 1995
While currently there is unilateral emphasis on 'first' discoveries, there should be as much emphasis on replication of discoveries.
-- John P. A. Ioannidis
papers are not static views of the end state of a research conclusion, but an interactive document where the reader can become the analyst.
selecting any figure in a scientific paper brings about interactivy.
extracting a working software environment and data from every article, report, and book is as simple as clicking a hyperlink.
all data from a discipline is accessible in a public data base for perpetuity.
I need some human motion walking data. I know tons of people have collected data on human motion. I've seen tons of papers and there are gait labs all over the world now.
Oh! Look! There is the website called http://humanmotiondata.org. It looks like I just have to type in this query...
where (treadmill is True) AND (20 < age < 40) AND (1.0 < speed < 2.0) AND (num_markers > 30) AND (force_plate is dual)
Querying results...please wait.
You have found 30,735 trials of human motion data, download C3D formatted data at this link http://humanmotiondata.org/your-awesome-data-set.zip
Software and Data have a Magic Property:
Copying is extrermely low cost and in many cases instantaneous.
Currently data is horded by the collector because they the only thing of value are the results. Academia does not value good data collectors the way it does good analysts.
How much time do you spend developing code to get your results?
How much time do you spend collecting data?
Should we let other scientists "stand on our shoulders"?
Should other scientists have to reinvent the wheel?
How about so much so that you write software to extract the data from the image.
Most data sees one use and is locked away in lab notebooks and old hard drives.
Most scientists think of programming as a tax they have to pay in order to do science.
Scientists do not care about reproducibility, as it is a perceived hurdle to productivity.
Respondents reported that the single biggest barrier to sharing code and data was the time it takes to clean up and document the work to prepare it for release and reuse (56 percent of respondents cited this reason for not sharing data and 78 percent cited this reason for not sharing code.).
--- http://web.stanford.edu/~vcs/papers/CiSE2012-LMS.pdf
Reproducibility is the ability of an entire experiment or study to be reproduced, either by the researcher or by someone else working independently. It is one of the main principles of the scientific method.
Repeatability is the degree of agreement of tests or measurements on replicate specimens by the same observer in the same laboratory.
The biotech company Amgen had a team of about 100 scientists trying to reproduce the findings of 53 “landmark” articles in cancer research published by reputable labs in top journals.
Only 6 of the 53 studies were reproduced (about 10%).
Scientists at the pharmaceutical company, Bayer, examined 67 target-validation projects in oncology, women’s health, and cardiovascular medicine.
Published results were reproduced in only 14 out of 67 projects (about 21%).
The project, PsychFileDrawer, dedicated to replication of published articles in experimental psychology, shows a replication rate 3 out of 9 (33%) so far.
For some years now, scientists have gotten increasingly worried about replication failures. In one recent example, NASA made a headline-grabbing announcement in 2010 that scientists had found bacteria that could live on arsenic—a finding that would require biology textbooks to be rewritten. At the time, many experts condemned the paper as a poor piece of science that shouldn’t have been published. This July, two teams of scientists reported that they couldn’t replicate the results.
The Whitehouse shared their Open Access Mandate this past year:
That’s why, in a policy memorandum released today, OSTP Director John Holdren has directed Federal agencies with more than $100M in R&D expenditures to develop plans to make the published results of federally funded research freely available to the public within one year of publication and requiring researchers to better account for and manage the digital data resulting from federally funded scientific research.
In 2011, the National Science Foundation started requiring a "Data Management Plan" with every grant submission.
The requirements were purposely left vague for each discipline. Over time the requirements for disciplines will emerge.
are making policies, such as the Data Policy from PLoS One:
Publication is conditional upon the agreement of the authors to make freely available any materials and information described in their publication that may be reasonably requested by others.
--- http://www.plosone.org/static/policies.action#sharing
Science and the Proceedings of the National Academy of Sciences (PNAS) have made data and code disclosure a requirement for publication.
The journal Biostatistics, for which I am an associate editor, has implemented a policy for encouraging authors of accepted papers to make their work reproducible by others.
--- Roger D. Peng, Reproducible Research in Computational Science, Science 2 December 2011, Vol. 334 no. 6060 pp. 1226-1227, DOI: 10.1126/science.1213847
The principal goal of these discussions and workshops is to develop publication standards akin to both the proof in mathematics and the deductive sciences, and the detailed descriptive protocols in the empirical sciences (the “methods” section of a paper describing the mechanics of the controlled experiment and hypothesis test). Computational science is only a few decades old and must develop similar standards, so that other research ers in the field can independently verify published results.
--- http://web.stanford.edu/~vcs/papers/CiSE2012-LMS.pdf
For Every Result, Keep Track of How It Was Produced
Avoid Manual Data Manipulation Steps
See "Spreadsheet Errors Cost Billions": http://www.cnbc.com/id/100923538
Interactive programs should always be able to save their state so they can restart. Otherwise, dependence on an interactive program can be a form of slavery (nonreproducible research).
--- http://sepwww.stanford.edu/sep/jon/reproducible.html
Archive the Exact Versions of All External Programs Used
Version Control All Custom Scripts
Record All Intermediate Results, When Possible in Standardized Formats
For Analyses That Include Randomness, Note Underlying Random Seeds
Always Store Raw Data behind Plots
Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
Connect Textual Statements to Underlying Results
Provide Public Access to Scripts, Runs, and Results
Source code must be shared just like the "methods" section in a bench scientists' paper. This ensures that others can read your code (and hopefully run it too!)
Data must be shared with adequate metadata and ideally be machine readable.
Open source software inherently provides computational reproducibility.
efficient array manipulation, i.e. vectorized operations
from numpy.random import random
from numpy.core.umath_tests import matrix_multiply
left_matrices = random((1e6, 3, 4))
right_matrices = random((1e6, 4, 3))
%timeit products = matrix_multiply(left_matrices, right_matrices)
1 loops, best of 3: 95.3 ms per loop
common scientific alogrithms interpolation, integration, signal processing, linear algebra, optimization, sparse matrices, etc
from numpy import array
from scipy.optimize import minimize
def rosen(x):
"""The Rosenbrock function"""
return sum(100.0 * (x[1:] - x[:-1]**2.0)**2.0 + (1 - x[:-1])**2.0)
x0 = array([1.3, 0.7, 0.8, 1.9, 1.2])
res = minimize(rosen, x0, method='nelder-mead',
options={'xtol': 1e-8, 'disp': True})
Optimization terminated successfully. Current function value: 0.000000 Iterations: 339 Function evaluations: 571 [ 1. 1. 1. 1. 1.]
Symbolic mathematics
from sympy import symbols, solve, init_printing, integrate
a, b, c, x = symbols('a, b, c, x')
f = a * x**2 + b * x + c
solve(f, x)
integrate(f, x)
Data munging
from pandas import date_range, DataFrame
dates = date_range('20130101',periods=6)
df = DataFrame(random((6, 4)), index=dates, columns=list('ABCD'))
A | B | C | D | |
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | 0.538045 | 0.438604 | 0.454992 | 0.436014 |
std | 0.224755 | 0.250052 | 0.181900 | 0.353036 |
min | 0.172686 | 0.052978 | 0.223056 | 0.042327 |
25% | 0.440082 | 0.348858 | 0.306948 | 0.242781 |
50% | 0.559414 | 0.408932 | 0.495694 | 0.311582 |
75% | 0.717056 | 0.612999 | 0.596418 | 0.630892 |
max | 0.766848 | 0.750929 | 0.641485 | 0.993420 |
Statistical Computing and Graphics
%load_ext rpy2.ipython
mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars$gear <- factor(mtcars$gear,levels=c(3,4,5),
mtcars$am <- factor(mtcars$am,levels=c(0,1),
mtcars$cyl <- factor(mtcars$cyl,levels=c(4,6,8),
qplot(mpg, data=mtcars, geom="density", fill=gear, alpha=I(.5),
main="Distribution of Gas Milage", xlab="Miles Per Gallon",
x <- c(1,2,3,4,5,6)
y <- x^2
lm_1 <- lm(y ~ x)
par(mfrow=c(2, 2))
Call: lm(formula = y ~ x) Coefficients: (Intercept) x -9.333 7.000
Open source Matlab clone
%load_ext oct2py.ipython
[x, y] = meshgrid(0:0.1:3);
r = sin(x - 0.5).^2 + cos(y - 0.5).^2;
surf(x, y, r);
\title{Sweave Example 1}
\author{Friedrich Leisch}
In this example we embed parts of the examples from the
\texttt{kruskal.test} help page into a \LaTeX{} document:
kruskal.test(Ozone ~ Month, data = airquality)
which shows that the location parameter of the Ozone
distribution varies significantly from month to month. Finally we
include a boxplot of the data:
boxplot(Ozone ~ Month, data = airquality)
What do we do with the data?
We have something!
Ton was an early leader in sharing:
A hosting service for biomechanical related projects. Includes project page and version control for source code.
Journals are starting to accept papers strictly about data.
Strive for reproducible work
Share your source!
Share your data!
Ask for reproducibiilty when reviewing journal articles
Ask Journals to support source code and data sharing practices
Encourage your students to create reproducible work
