Reproducible Scientific Computing with Open Software and Open Data

If I have seen further it is by standing on the shoulders of giants.

-- Isaac Newton

Jason K. Moore
September 17, 2014
Human Motion and Control Seminar
Cleveland State University

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.

-- Jonathan Buckheit and David Donoho, paraphrasing Jon Claerbout 1995

While currently there is unilateral emphasis on 'first' discoveries, there should be as much emphasis on replication of discoveries.

-- John P. A. Ioannidis

Image a Future where...

papers are not static views of the end state of a research conclusion, but an interactive document where the reader can become the analyst.

selecting any figure in a scientific paper brings about interactivy.

extracting a working software environment and data from every article, report, and book is as simple as clicking a hyperlink.

all data from a discipline is accessible in a public data base for perpetuity.

I want future with this

I need some human motion walking data. I know tons of people have collected data on human motion. I've seen tons of papers and there are gait labs all over the world now.

Oh! Look! There is the website called http://humanmotiondata.org. It looks like I just have to type in this query...

where (treadmill is True) AND (20 < age < 40) AND (1.0 < speed < 2.0) AND (num_markers > 30) AND (force_plate is dual)

Querying results...please wait.
Done.
You have found 30,735 trials of human motion data, download C3D formatted data at this link http://humanmotiondata.org/your-awesome-data-set.zip

Copying is "free" with digital objects

Software and Data have a Magic Property:

Copying is extrermely low cost and in many cases instantaneous.

Currently data is horded by the collector because they the only thing of value are the results. Academia does not value good data collectors the way it does good analysts.

Questions

How much time do you spend developing code to get your results?

How much time do you spend collecting data?

Should we let other scientists "stand on our shoulders"?

Should other scientists have to reinvent the wheel?

Have you ever desired to have the data that produced a figure in a paper?

How about so much so that you write software to extract the data from the image.

http://arohatgi.info/WebPlotDigitizer/

In [1]:
from IPython.display import YouTubeVideo

YouTubeVideo('G8KPVVFrb6U')
Out[1]:

Facts

Most data sees one use and is locked away in lab notebooks and old hard drives.

Most scientists think of programming as a tax they have to pay in order to do science.

Scientists do not care about reproducibility, as it is a perceived hurdle to productivity.

Respondents reported that the single biggest barrier to sharing code and data was the time it takes to clean up and document the work to prepare it for release and reuse (56 percent of respondents cited this reason for not sharing data and 78 percent cited this reason for not sharing code.).

--- http://web.stanford.edu/~vcs/papers/CiSE2012-LMS.pdf

What is reproducibility?

Reproducibility is the ability of an entire experiment or study to be reproduced, either by the researcher or by someone else working independently. It is one of the main principles of the scientific method.

http://en.wikipedia.org/wiki/Reproducibility

What is repeatibility?

Repeatability is the degree of agreement of tests or measurements on replicate specimens by the same observer in the same laboratory.

Is science actually reproducible?

ref

The biotech company Amgen had a team of about 100 scientists trying to reproduce the findings of 53 “landmark” articles in cancer research published by reputable labs in top journals.

Only 6 of the 53 studies were reproduced (about 10%).

http://www.reuters.com/article/2012/03/28/us-science-cancer-idUSBRE82R12P20120328

Scientists at the pharmaceutical company, Bayer, examined 67 target-validation projects in oncology, women’s health, and cardiovascular medicine.

Published results were reproduced in only 14 out of 67 projects (about 21%).

http://blogs.nature.com/news/2011/09/reliability_of_new_drug_target.html

The project, PsychFileDrawer, dedicated to replication of published articles in experimental psychology, shows a replication rate 3 out of 9 (33%) so far.

http://www.psychfiledrawer.org/view_article_list.php

2010 Arsenic Based Life Controversy

For some years now, scientists have gotten increasingly worried about replication failures. In one recent example, NASA made a headline-grabbing announcement in 2010 that scientists had found bacteria that could live on arsenic—a finding that would require biology textbooks to be rewritten. At the time, many experts condemned the paper as a poor piece of science that shouldn’t have been published. This July, two teams of scientists reported that they couldn’t replicate the results.

http://www.slate.com/articles/health_and_science/science/2012/08/reproducing_scientific_studies_a_good_housekeeping_seal_of_approval_.html

http://en.wikipedia.org/wiki/Felisa_Wolfe-Simon#Controversy

http://retractionwatch.com/

Policy

The Whitehouse

The Whitehouse shared their Open Access Mandate this past year:

That’s why, in a policy memorandum released today, OSTP Director John Holdren has directed Federal agencies with more than $100M in R&D expenditures to develop plans to make the published results of federally funded research freely available to the public within one year of publication and requiring researchers to better account for and manage the digital data resulting from federally funded scientific research.

The NSF

In 2011, the National Science Foundation started requiring a "Data Management Plan" with every grant submission.

The requirements were purposely left vague for each discipline. Over time the requirements for disciplines will emerge.

Journals

are making policies, such as the Data Policy from PLoS One:

Publication is conditional upon the agreement of the authors to make freely available any materials and information described in their publication that may be reasonably requested by others.

--- http://www.plosone.org/static/policies.action#sharing

Science and the Proceedings of the National Academy of Sciences (PNAS) have made data and code disclosure a requirement for publication.

The journal Biostatistics, for which I am an associate editor, has implemented a policy for encouraging authors of accepted papers to make their work reproducible by others.

--- Roger D. Peng, Reproducible Research in Computational Science, Science 2 December 2011, Vol. 334 no. 6060 pp. 1226-1227, DOI: 10.1126/science.1213847

Computational Science Needs Its Methods

The principal goal of these discussions and workshops is to develop publication standards akin to both the proof in mathematics and the deductive sciences, and the detailed descriptive protocols in the empirical sciences (the “methods” section of a paper describing the mechanics of the controlled experiment and hypothesis test). Computational science is only a few decades old and must develop similar standards, so that other research ers in the field can independently verify published results.

--- http://web.stanford.edu/~vcs/papers/CiSE2012-LMS.pdf

The Science Code Manifesto

In [2]:
from IPython.display import HTML

HTML('<iframe src="http://sciencecodemanifesto.org/" width=1000 height=600></iframe>')
Out[2]:

Ten Simple Rules for Reproducible Computational Research

http://dx.doi.org/10.1371/journal.pcbi.1003285

Rule 1

For Every Result, Keep Track of How It Was Produced

Rule 2

Avoid Manual Data Manipulation Steps

See "Spreadsheet Errors Cost Billions": http://www.cnbc.com/id/100923538

  • 88 percent of all spreadsheets have errors in them
  • 50 percent of spreadsheets used by large companies have material defects

Interactive programs should always be able to save their state so they can restart. Otherwise, dependence on an interactive program can be a form of slavery (nonreproducible research).

--- http://sepwww.stanford.edu/sep/jon/reproducible.html

Rule 3

Archive the Exact Versions of All External Programs Used

Rule 4

Version Control All Custom Scripts

Rule 5

Record All Intermediate Results, When Possible in Standardized Formats

Rule 6

For Analyses That Include Randomness, Note Underlying Random Seeds

Rule 7

Always Store Raw Data behind Plots

Rule 8

Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

Rule 9

Connect Textual Statements to Underlying Results

Rule 10

Provide Public Access to Scripts, Runs, and Results

Two main ingredients of computation reproduciblity

1

Source code must be shared just like the "methods" section in a bench scientists' paper. This ensures that others can read your code (and hopefully run it too!)

2

Data must be shared with adequate metadata and ideally be machine readable.

Open Source Software For Science

Open source software inherently provides computational reproducibility.

Python and the SciPy stack

NumPy

efficient array manipulation, i.e. vectorized operations

In [3]:
from numpy.random import random
from numpy.core.umath_tests import matrix_multiply

left_matrices = random((1e6, 3, 4))
right_matrices = random((1e6, 4, 3))

%timeit products = matrix_multiply(left_matrices, right_matrices)
1 loops, best of 3: 95.3 ms per loop

SciPy

common scientific alogrithms interpolation, integration, signal processing, linear algebra, optimization, sparse matrices, etc

In [4]:
from numpy import array
from scipy.optimize import minimize

def rosen(x):
    """The Rosenbrock function"""
    return sum(100.0 * (x[1:] - x[:-1]**2.0)**2.0 + (1 - x[:-1])**2.0)

x0 = array([1.3, 0.7, 0.8, 1.9, 1.2])
res = minimize(rosen, x0, method='nelder-mead',
               options={'xtol': 1e-8, 'disp': True})

print(res.x)
Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 339
         Function evaluations: 571
[ 1.  1.  1.  1.  1.]

SymPy

Symbolic mathematics

In [5]:
from sympy import symbols, solve, init_printing, integrate

init_printing()

a, b, c, x = symbols('a, b, c, x')

f = a * x**2 + b * x + c

solve(f, x)
Out[5]:
$$\begin{bmatrix}\frac{1}{2 a} \left(- b + \sqrt{- 4 a c + b^{2}}\right), & - \frac{1}{2 a} \left(b + \sqrt{- 4 a c + b^{2}}\right)\end{bmatrix}$$
In [6]:
integrate(f, x)
Out[6]:
$$\frac{a x^{3}}{3} + \frac{b x^{2}}{2} + c x$$

Pandas

Data munging

In [7]:
from pandas import date_range, DataFrame

dates = date_range('20130101',periods=6)

df = DataFrame(random((6, 4)), index=dates, columns=list('ABCD'))

df.describe()
Out[7]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.538045 0.438604 0.454992 0.436014
std 0.224755 0.250052 0.181900 0.353036
min 0.172686 0.052978 0.223056 0.042327
25% 0.440082 0.348858 0.306948 0.242781
50% 0.559414 0.408932 0.495694 0.311582
75% 0.717056 0.612999 0.596418 0.630892
max 0.766848 0.750929 0.641485 0.993420

R

Statistical Computing and Graphics

http://www.r-project.org/

In [8]:
%load_ext rpy2.ipython

Data Frames

In [9]:
%%R

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Beautiful Plots

In [10]:
%%R

library(ggplot2)

mtcars$gear <- factor(mtcars$gear,levels=c(3,4,5),
   labels=c("3gears","4gears","5gears"))
mtcars$am <- factor(mtcars$am,levels=c(0,1),
   labels=c("Automatic","Manual"))
mtcars$cyl <- factor(mtcars$cyl,levels=c(4,6,8),
   labels=c("4cyl","6cyl","8cyl"))

qplot(mpg, data=mtcars, geom="density", fill=gear, alpha=I(.5),
   main="Distribution of Gas Milage", xlab="Miles Per Gallon",
   ylab="Density")

Regression

In [11]:
%%R

library(ggplot2)

x <- c(1,2,3,4,5,6)
y <- x^2
lm_1 <- lm(y ~ x)
print(lm_1)
summary(lm_1)
par(mfrow=c(2, 2))
plot(lm_1)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
     -9.333        7.000  


Octave

Open source Matlab clone

In [12]:
%load_ext oct2py.ipython
In [13]:
%%octave

[x, y] = meshgrid(0:0.1:3);
r = sin(x - 0.5).^2 + cos(y - 0.5).^2;
surf(x, y, r);

Julia

In [14]:
HTML('<iframe src="http://julialang.org/" width=800 height=600></iframe>')
Out[14]:

Examples

Versioned library of functions available in a source repository

http://f1000research.com/articles/3-223/v1

http://github.com/chrisdembia/yeadon

https://zenodo.org/record/11579#.VBmR1K2dSCQ

Literate Programming

Before

\documentclass[a4paper]{article}

\title{Sweave Example 1}
\author{Friedrich Leisch}

\begin{document}

\maketitle

In this example we embed parts of the examples from the
\texttt{kruskal.test} help page into a \LaTeX{} document:

<<>>=
data(airquality)
library(ctest)
kruskal.test(Ozone ~ Month, data = airquality)
@
which shows that the location parameter of the Ozone 
distribution varies significantly from month to month. Finally we
include a boxplot of the data:

\begin{center}
<<fig=TRUE,echo=FALSE>>=
boxplot(Ozone ~ Month, data = airquality)
@
\end{center}

\end{document}

After

http://www.stat.uni-muenchen.de/~leisch/Sweave/example-1.pdf

Data

What do we do with the data?

Depends on the data.

  • Genomicists have Genbank
  • Ecologists have Dryad

But catch all solutions exist.

Human Motion Source Code and Data

We have something!

Ton was an early leader in sharing:

In [15]:
HTML('<iframe src="http://isbweb.org/software/movanal.html" width=800 height=600></iframe>')
Out[15]:

SimTK

A hosting service for biomechanical related projects. Includes project page and version control for source code.

In [16]:
HTML('<iframe src="https://simtk.org/xml/index.xml" width=800 height=600></iframe>')
Out[16]:

Catch all repositories for data

  • Institutional repositories
  • Figshare and zenodo, get doi for data
  • Supplementary materials for Journals
  • Your website?

Figshare

http://figshare.com/articles/Bicycle_Rider_Control_Identification/659465

Data Papers

Journals are starting to accept papers strictly about data.

  • F1000Research: 'Data Notes'
  • PLoS One: Database papers
  • Biodiversity data journal
  • Open health data
  • more

What can you do?

Strive for reproducible work

Share your source!

Share your data!

Ask for reproducibiilty when reviewing journal articles

Ask Journals to support source code and data sharing practices

Encourage your students to create reproducible work

In [17]:
from IPython.html.widgets import interactive
from IPython.display import display

%matplotlib inline
In [18]:
import matplotlib.pyplot as plt
from numpy.linalg import eig
from dtk import bicycle

p = bicycle.benchmark_parameters()

M, C1, K0, K2 = bicycle.benchmark_par_to_canonical(p)
speeds, As, Bs = bicycle.benchmark_state_space_vs_speed(M, C1, K0, K2, v0=0.0, vf=10.0)
eigenvals, eigenvectors = eig(As)
fig, ax = plt.subplots()
lines = plt.plot(speeds, eigenvals, 'k.')
plt.ylim((-10.0, 10.0))
plt.close(fig)

def plot(v0=0.0, vf=10.0, **parameters):
    M, C1, K0, K2 = bicycle.benchmark_par_to_canonical(parameters)
    speeds, As, Bs = bicycle.benchmark_state_space_vs_speed(M, C1, K0, K2, v0=v0, vf=vf)
    eigenvals, eigenvectors = eig(As)
    for i, line in enumerate(lines):
        line.set_xdata = speeds
        line.set_ydata = eigenvals[i]
    fig.canvas.draw()
    display(fig)
/home/moorepants/anaconda/lib/python2.7/site-packages/numpy/core/numeric.py:462: ComplexWarning: Casting complex values to real discards the imaginary part
  return array(a, dtype, copy=False, order=order)

In [19]:
ranges = {k: (v - 0.5 * abs(v), v + 0.5 * abs(v), 0.1 * abs(v)) for k, v in p.items()}
ranges.update({'v0': (-30.0, 30.0, 1.0), 'vf': (-30.0, 30.0, 1.0)})
w = interactive(plot, **ranges)
display(w)