What Happens When Researchers ‘Clean’ Data?

What Happens When Researchers ‘Clean’ Data?

Data preparation is a crucial but often overlooked step in research that can drastically change results. In order to maintain useful research, we need better standards and documentation.

Many economists have likely heard an old joke from researchers outside the field about three scientists stranded on a desert island. The physicist suggests that they should use leverage to pop the top off of the can. The chemist suggests they build a fire and heat the can, thus blowing the lid off. They both look at the economist, who says, ‘Let’s assume a can opener.’”

When working with non-experimentally collected information, researchers must frequently make assumptions about how to process, clean and model their data.

Researchers are free to pursue their own interests!

If researchers are working with non-experimentally collected information, they must often make assumptions about how to process, clean and model the data to get precise results. "Researcher degrees of freedom" is a term for the choices researchers make during the research process, both stated and unstated. The flexibility to choose different decision paths when faced with the same data can lead to different results even when using identical data.

Degrees of Discrepancy

"Researcher degrees of freedom" is a term used to describe the decisions made during the research process, which can be either explicit or implicit. Additionally, various researchers could reasonably choose various decision paths when presented with the same data. This level of elasticity, though, can lead to various researchers yielding results that are extremely dissimilar from each other despite using the same data.

The issue with observational data is the sheer number of potential forks in the road; this leads to researchers having to make a lot of decisions independently from one another. However, at scale, the sheer number of decisions researchers typically don't think twice about and hardly ever document has precipitated the current replication crisis across the social sciences.

The problem has been exacerbated by the fact that over the last 15 years, there has been a rapid expansion of access to common data sources such as the Census, the Bureau of Labor Statistics, and the Federal Reserve. Without solid guidance, researchers must make many independent assumptions across all levels of the research process, which reduces the number of generalizable empirical insights.

Framing the Problem

Nick Huntington-Klein and his colleagues found that researcher degrees of freedom had a significant impact on conclusions drawn from empirical economic analyses.

Huntington-Klein and his colleagues gave data from two earlier published economic studies to seven different replicators. The researchers also framed the research questions to make sure the replicators could answer the same questions the published works addressed, but did so in a way that wouldn’t allow them to identify the published studies from the data.

The study found that replicators’ processing and cleaning of externally generated data led to significant discrepancies in outcomes. Replicators did not report the same sample size, sign, or estimate size; additionally, the standard deviation of estimates across replicators was three to four times as large as the standard error for each replicator. The last result indicated that variation among researcher decisions, which likely would have escaped documentation and therefore not appear to peer reviewers, was the culprit for such huge variation in outcomes.

Another team led by Uri Simonsohn has argued that researcher degrees of freedom come from two main sources: ambiguity surrounding best practices for data decisions and researchers’ desires to publish “statistically significant” results. Simonsohn and colleagues (2011) give 30 examples of similar decisions made with regards to what data constitutes outliers in reaction times and how researchers should deal with them. Although they had similar parameters, the articles showed a lot of variation.

The individual researcher's decisions weren't incorrect; however, the ambiguity of outlier treatment led to radically divergent results. Moreover, because any decision was seemingly justifiable, researchers had direct incentives to make choices that produced the most eye-catching results.

Theoretical Variance

The impact of researcher degrees of freedom are not even confined to the realm of empirics. Some years ago, two researchers published a paper that purportedly demonstrated how economists did not fully grasp the concept of opportunity cost (the principle that the cost of a given activity is the foregone benefit of the next best alternative), a fundamental and supposedly straightforward concept in economic decision making. The textbook question the researchers asked of 199 economists was as follows:

If you have a free ticket to an Eric Clapton concert, but you would rather go see Bob Dylan, as he also performs that night. Tickets for Bob Dylan cost 40 $, but you're usually willing to pay 50 $. The opportunity cost therefore calculates to 10 $. The 50 $ benefit of Dylan minus the 40 $ cost of the ticket calculates to the opportunity cost of 10 $ for going to the Clapton concert for free.

A rebuttal article showed that, because there is no set definition for "opportunity cost," it is difficult to say which of the four answer choices is correct. This is because people can have different interpretations of what is considered a cost and a benefit.

The opportunity cost of attending the Dylan concert is the value of the next-best alternative use of the resources required to attend, which include the ticket price, travel costs, and time.

Unskewing the Results

Given that most researcher degrees-of-freedom divergences come from ambiguities in data, possible solutions focus on lessening the load on researchers by making these decisions for them and mandating that researchers be more explicit about the decisions they do have to make. Huntington-Klein and Simonsohn both propose that, in order to be thorough, researchers should document all constructed variables in data appendices - even if these variables are ultimately unused.

All data exclusions, modeling decisions that result in non-results, and failed manipulations must be documented by researchers in an appendix. It is absolutely necessary that researchers have a greater understanding of the messiness of data processing and estimation, even if it means that otherwise sharp results are not as clear.

If many researchers use the same data sources, which is frequent in non-experimental research, pre-processing commonly used data or creating a best-practices guidebook for the use of common data can minimize potential sources for researcher decision making.

The National Bureau of Economic Research's (NBER) standard merging process for the Merged Outgoing Rotation Group Files (MORG) from the Census Bureau's Current Population Survey (CPS) is a good example of this kind of measure. By making the file-matching process and all assumptions uniform, researcher variance in how CPS data is combined has been eliminated, resulting in more uniform results. Pre-processing common data sources with standard code or guidebooks is an effective way to avoid ambiguity in data processing.

Averaging together disparate estimates that address the same question using identical data can help to mitigate noise in estimation arising from the garden of forking paths. By using ensemble or model-averaging methods, we can be more confident that an estimate from multiple research strains is accurate. Having multiple sources of estimates is important to reduce the noise that arises from researcher flexibility in data decision-making.

The "noise audit" is a process where expert decision makers are asked to make decisions on a pre-selected scenario multiple times, in order to find any inconsistencies. This audit evaluates how well each person's average decision agrees with the overall average of all decision makers, as well as how closely each decision maker's personal choices compare to his or her own average. If the circumstances are nearly identical, does the decision maker make the same or similar choices?

The world is complex and uncertain, making judgement difficult. This is especially true when data collected comes from uncontrolled real-world processes. However, noise can be discovered and reduced by using set rules and guidelines put forth by the authors.

When it comes to empirical research, this saying is certainly correct. By being transparent and standardized, and to a smaller scale, aggregating data, the assumptions that researchers make during the research process can be offset by the noise. According to Kahneman, Sidley and Sunstein, if there is too much inconsistency (noise), it will damage the credibility of the system.

Developer Jobs in Austria

This might also interest you