# Question:What are important aspects to consider about replication on the website and in the community?

liz is asking a question about question:

liz asked on September 28, 2016 20:50

Hi folks! i'm posting this as a question because in conversation over the past week, many people have already contributed many perspectives on this topic, ranging from the scientific validity side, to the education and onboarding experience, the design of the web interface, and more. Please add your perspective as an "answer"! Thank you.

• Also, please let me know if you'd like to book club Chapter Two "The Idea of Replication" in Harry Collins' book Changing Order: Replication and Induction in Scientific Practice.

To expand upon Liz's question, we’ve started talking about harnessing the unique power of open, collaborative science: the ability to massively replicate experiments and compare results, allowing us to better assess and communicate experiments’ reproducibility and resultant implications. To facilitate this more organized (and hopefully impactful engagement), we’ve started asking people to make more detailed research notes, tagged with the type of activity it is, inviting people into their research to try to replicate it, so that we can all learn from the process and results. Assessing the outcomes of the replication has many factors though. Here, we’re wondering what factors are important to consider in asking facilitating replications, evaluating those replications, deciding to fork new activities, amassing information from iterated development, and anything else that seems important in terms of inviting and utilizing mass open science!

The first thing that comes to my mind when thinking about inviting or evaluating replication is that constitutes “replication” is different for different kinds of activities: for a tool build, the verification steps can demonstrate whether or not the build replication was successful; for a lab experiment, you need to construct the same conditions and follow the same steps; for a field test, you need to follow the same steps, and you can try to find similar conditions, but this is the real world, so it’s not going to be an exact replication, and it’s extremely important to document the conditions such that the way different conditions impact the results can be elucidated. So, we may need to be more clear about what exactly “replication” means for various situations. Ideas?

Another thing that comes to my mind in terms of evaluating replications is that there is a difference between success in terms of someone clearly articulating steps and someone else being able to correctly follow them, versus people successfully following the same steps but finding different results (which is successful in that it can tell us something about precision or influence of various conditions, but could also be considered failure in that the original data wasn’t reproducible), and success in terms of being able to accomplish what you had hoped. Does anyone have ideas about how we should distinguish these? Can we come up with evaluation rubrics? Also, who should determine what is deemed successful? I’m personally prone to having the person who conducted the activity (be in initial or replication) assess its successfulness, but maybe there is a more systematic way? Or, that person assesses its successfulness using a standard rubric that we collectively create? I think that this point -- about whose duty it is to assess outcomes -- is pretty important when we consider social dynamics and democratization of science. What do people think?

I have a few other ideas for how replication could work:

• @mathew suggested: ways to encourage people to post "partway done" work, to ask for help and post their work so far
• how (in terms of interface) to mark something with success/failure: a prompt that creates a replication:success vs. replication:unsuccessful tag?
• echoing Gretchen, above -- who gets to mark it as such (author or replicator)
• how success condition is described/tested/debated -- and how the activity can be refined to have a narrower and more clearly articulated "success condition"
• perhaps a "mini-quiz" which replicators fill out, which helps establish and (through posting links to raw data) support a claimed "success"
• also via @gretchengehrke -- a positive, discursive forum will be important to reach consensus on "success"

And one thought I had -- are people submitting replications incentivized to say that they've succeeded, or activity authors to agree? Or are activity authors incentivized to say that replications have failed? It probably depends a lot on the situation, but I could imagine these playing out this way...

Some stray thoughts copied and pasted from an email exchange with Liz:

Theorizing replication is a cottage industry within STS. Collins deals with it in terms of reconstructing the TEA Laser in Changing Order. I think that's the most relevant case for Public Lab, because it deals with replicating the assembly of a piece of technical lab equipment (rather than replicating research results). He also deals with replication as a sort of red herring in his introductory chapters. Everyone talks about it, no one ever actually does it for most studies. And what does it mean to replicate? You can never perfectly replicate a study. The beakers were a little different. The lab was the wrong temperature. The moon was in the seventh house instead of the sixth and Jupiter had yet to align with Mars. And so the results are off. But are they off because the original results were invalid or because the replicator failed to sufficiently create 1:1 conditions for the second test? Who can say? Not the experiment itself. And so we get what Harry Collins calls "the experimenter's regress." When there's controversy over findings in this way, at some point there are social dynamics that take over from experimental protocols to end the infinite regress.

So, not only is replication rarely ever actually used as a litmus test for good science (why twould I spend all that time and money trying to perform the same operations you're describing in your journal article? What glory is in it for me? And who's going to pay for it!?), but when replication results in controversy, the regress sets in.

Then there's a whole history of the construction of replication as a standard in science, which Steven Shapin focuses on in much of his work (beginning with The Leviathan and the Air Pump).

Once I read Collins' treatment of replication I kind of stopped paying attention to the topic. Sort of an intellectual dead end if you buy into the concept of the experimenter's regress. Worth noting all the psychological studies that have recently been "debunked" after failures to replicate the same findings. Like... duh, right? If you can use minor variations in highly controllable experiments in the biophysical sciences to ignite the experimenter's regress, then the near infinity of uncontrollable variables in human behavior make it really easy to claim that later researchers just failed to adequately duplicate the conditions of the first study. As if we could clone and copy the psyches of the people in those first studies to use them again for replication.

Or, yeah, the studies were just bad science to begin with. But that's a different story.

Replication (reproducibility) in scientific studies has become a topic of conversation because:

1. Recent investigations suggest that few researchers ever try to replicate studies (reproduce the results) and when they do they often fail to get the same result. Most of the publicized examples of this type of failure are in the pharmaceutical and medical fields (where this outcome should concern us).
2. Sociologists noticed that published papers sometimes fail to include enough information to allow others to replicate a study (reproduce the study’s results). In one famous case, new researchers had to work closely with the original team to get the same result. For some reason, sociologists thought this was important.

The type of studies referred to above generally involve experiments. Precise measurements were made in carefully manipulated environments where many variables were controlled so that they could not confuse the outcome. It should be possible to replicate these studies (reproduce the results) – otherwise the results of the original experiment must be questioned. This is a key component of scientific research.

If you are not doing an experiment, this type of replication (reproducibility) might not be an important part of the process. If the other categories of Public Lab activities are being done (Build, Verify, Observations, Test tool limits, Field test, Monitor your environment) it will be good to see that someone else can do something similar, but the concept of reproducibility might not be applicable. There generally will not be any singular result to reproduce.

There will be exceptions when the original activity specifies a controlled environment, a carefully thought out procedure, and multiple trials including controls, in which case that activity was indeed an experiment. Attempts could be made to reproduce the results of that type of activity, but this might be a very rare situation in Public Lab activities.

In most Public Lab activities, the goal is far less specific: Does the kite fly well? Does your spectrum look like mine? Do the colors in my NDVI image look meaningful? Did the device log data every five minutes? Repeating this type of activity is a repetition, not a replication, and the sociological conversation about reproducibility does not really apply.

It will be good to have a record of multiple people building a certain spectrometer, or doing a careful job calibrating a spectrometer, or getting meaningful NDVI images from an infrared camera, or making a circuit respond to external stimuli. Calling these activities replications might confuse people about why reproducibility is critical in scientific research. Grasping the conceptual importance of reproducibility will come in handy when you finally get down to doing an experiment to see if your technique can identify environmental contamination. If nobody else can reproduce your results, your technique will not be useful.

When the scope of activities includes things that are experiments and also things that are not experiments, reserve the word experiment for the experiments. When the scope of activities includes reproducible experiments but also other things that are just attempts to repeat a procedure, don’t call the repeats replications. This is not just semantics; it is essential if your goal is to help people understand concepts they might not be familiar with.

The conceptual differences among replication, reproducibility, and repeating a procedure are not trivial. It does not help that the term replication is used two ways – it is a common term for the use of multiple samples or trials in an experiment, and is also used to refer to the reproducibility of an experiment’s results. These three concepts are distinct, and all are important concepts to grasp as you design, implement, and present scientific research. Those familiar with research might be less likely to take your results seriously if these concepts are confused.

Chris

It might be good to agree on definitions of some basic terms. Here is one suggestion.

#### 1. Reproducibility, Reproducing the results of an experiment:

Repeating an experiment and getting a result that leads to the same conclusion. More technically, using the same experimental procedure to test the same hypothesis and coming to the same conclusion about the hypothesis (confirming or rejecting it). Also performing a different, related experiment that produces a result consistent with the conclusion of the original experiment.

This can also be applied to doing something that does not appear to be an experiment. For example, if someone uses a mercury thermometer to measure the temperature of water in an ice bath and gets a result of 32.1 ± 0.8°F (n= 10 measurements), this result could be reproduced using a Riffle and DHT sensor. If the Riffle results are 32.4 ± 1.1°F (n= 10 measurements), then the result has been reproduced (i.e., there is no statistical difference between those two results). Although this appears to be just two measurements and not an experiment, it could be done so that all of the requirements of an experiment are fulfilled:

• A stated hypothesis (e.g., the measurement of ice water temperature is not different from 32°F)
• A procedure appropriate for the system (e.g., lots of ice and water)
• A number of replicates (multiple measurements) which is appropriate to describe the variability of the device (the thermometer) and the parameter (the water temperature).
• An appropriate statistical test.

This would be a very simple experiment, but it is nonetheless an experiment. Therefore, its result should be reproducible. In this sense, the results of simple observations or measurements can be reproduced as long as the series of observations or measurements meet the above requirements and are therefore bonifide experiments.

#### 2. Replication, Replicates:

Multiple units of study (samples, trials, measurements, study plots, days, populations, etc.) which are required to account for the different types of variability in the subject of study and in the method of study.

These replicates (or replicate samples, replicate measurements, etc.) must be collected under the same conditions and in the same way. This type of replication is the basis for all statistical analysis because multiple data points allow the variability in some parameter to be quantified.

Replication must be done at multiple levels depending on the question being asked (i.e., on the hypothesis being tested). For example, if asking the question Do these two air samples differ in the amount of suspended silica? then a lab procedure could be done on five replicate subsamples from each air sample. However, if the question is Do silica mines pollute the air? then the experiment might require collecting 10 replicate air samples at each of 10 replicate sites near each of 10 replicate mines and also 10 replicate control locations on 10 different replicate days, and then running five replicate lab analyses on each sample. The number of replicates required at each level depends on how much the measured parameter varies at that level and is often not known until the samples are measured.

#### 3. Repetition, Repeating a procedure:

Doing something multiple times.

When there is no way to test whether the outcome of repeating a procedure is the same every time it is done, then the concept of reproducibility does not apply.

Building and modifying devices

Following someone else’s instructions to build or modify a device is not reproducing a result unless a test can be done to determine if some predetermined specifications have been met. If such a test is available, then building or modifying a device can become part of an experiment. If the above requirements of an experiment are met, then the test can be used to determine if the build or modification has reproduced the results of the original. Multiple builds might be required because each would be a replicate in the experiment.

In most cases, the requirements of an experiment are not met when people follow instructions to build something, often because there are too many variables to control. However, anything can be part of an experiment if careful planning is done and replication is sufficient to account for the inherent variability.

Environmental monitoring

If five people each use their Riffle to monitor the water temperature in a stream near their house, this is repetition. It might be difficult to argue that these activities are part of a single experiment or are replicates or are reproducing a result. It’s just five people more or less repeating the same activity. Additional measurements, restrictions, replications, or controls could allow this type of activity to be part of an experiment, but monitoring by itself is not an experiment and often the results cannot be reproduced (environmental variation can make this difficult).

### Is replication the same as reproducing the result of an experiment?

The answer to this question is messy because the term replication is used to refer both to experimental replicates and to replicate experiments. An experimental replicate is a unit of study (e.g., one of five random study plots) and a replicate experiment is when an entire study is repeated to see if the same result is achieved.

But there are two different types of “replicate” experiments. The obvious one is when every detail of the original experiment is repeated. For several very good reasons, this almost never happens. The only thing you can learn from this is whether the original experiment might have been flawed in some way. When research has been peer reviewed, we assume it is not very flawed, so repeating the exact same experiment rarely tells us anything worthwhile. (The assumption that peer reviewed research is robust apparently does not apply to some fields of research.)

The more common response to earlier research is to do a similar study designed to apply a different test of the greater question at hand. So a new study can be done with some important variables changed. It can be done with a different organism, or at a different place, or at a different time of year. A different technique can be used to measure the important thing in the study, or a different but related thing can be measured. The goal is not to search for flaws, but to see if the results of the new study are consistent with the results of the original study.

For example, consider a study that tested the hypothesis that precipitation events carry nutrients from fertilized fields down the watershed toward streams. Three streams downhill from fertilized fields were sampled every two hours for nitrogen in the water, and precipitation events were recorded. The result was that nitrogen in the water increased significantly two hours after big rain events compared to two hours before rain events. This is consistent with the stated hypothesis that precipitation caused the movement of nutrients from fields toward streams.

Replicating this study just requires copying everything that was done the first time. But reproducing the original result (confirming the hypothesis) can be done with many different studies. For example, phosphorus could be measured instead of nitrogen. Water could be sampled while it is running across the ground instead of after it enters the stream. Isotopically labeled nitrogen can be added to the fields to see if those very molecules can be detected in the streams. Shallow ground water can be extracted from soil in transects from the fields down to the streams and analyzed for nutrients. The study can be repeated in spring instead of fall, or after bigger rain events or after snow melt events. The nutrient content of streamside plants could be measured near streams downhill from fertilized fields and also near streams far from fields.

Unlike the simple approach of repeating the original study, these studies can add important new information while also reproducing the results (confirming the hypothesis) of the original study. These new studies can strengthen our confidence that nutrients (do or don't) move from field to stream, and do so far more effectively than just repeating what the first researcher did. This is what usually happens in science.

This is why the idea of “replicating” a study needs clarification. The more important goal is to “reproduce the result” of a study, not just repeat the study. The term “reproducibility” can refer to this idea of confirming the big hypothesis with a strategically designed new study. That is the primary way that science advances, and the reason that in this context the term “replication” is misleading and should probably be avoided.

#### Summary

• Replicating an experiment: Repeating an experiment to see if an earlier implementation of it was flawed.
• Reproducing the results of an experiment: Performing a similar or related experiment to test the hypothesis confirmed by an earlier experiment and increase our understanding of the study system.