How Data Visualizations Can Be Misleading

How Data Visualizations Can Be Misleading

Data visualizations can often be misleading.

Burton Malkiel, author of A Random Walk Down Wall Street, devised a method to test a theory that stock price movements are essentially random and unpredictable. Malkiel set an initial price of $50 for a made-up stock, and had his students flip a coin every day. If the coin came up heads, the fictitious stock price went up by a half of a basis point. If the coin landed on tails, the stock price went down by the same amount. Malkiel had the students keep charts of their “stock prices” over the course of his class.

Chartists, who claimed they could use past histories of stock price movements to determine future stock prices, gave confident opinions to Malkiel that ranged from "Buy right now!" to "Sell right now!" and everything in between. It is interesting that none of the chartists Malkiel showed these graphics to (which were based on a purely random process of flipping coins) realized that they were just random noise without a meaningful pattern. They all assumed that there must have been a legible system at work.

Malkiel argued that psychology was the most likely explanation for why humans need to read patterns into random data. He said that people struggle to cope with a lack of order and often see patterns in data where there are none.

Spurious Correlation

Data visualization can be used to find trends in data, but it is important to be careful when interpreting the results. The availability of data visualization tools has resulted in a corresponding increase in analytical dilettantism.

This fallacy is committed when two variables appear to have a clear relationship on a graph, even though they are unrelated. Tyler Vigen runs a website that is dedicated to visualizing the most ridiculous examples of data. For example, there is no logical connection between the amount of money spent by the United States on science, space and technology initiatives and the total number of suicides by hanging, strangulation and suffocation. Vigen found that, over the 10-year period from 1999 to 2009, both values had a close visual evolution and a correlation coefficient of 99.8 percent.

It is easy to create data visualizations, and there are many options available for data presentation. This can lead researchers to choose the most appealing charts to make their cases.

The Big Picture

If a researcher wants to convince people that her conclusion is correct, she should only present the most convincing graphical evidence to the public. Publishing all relevant data visualizations would allow for a more compelling chart to be benchmarked against the different graphical choices, scales and transformations that it went through prior to its final form. This would allow the public to evaluate any claims with a hermeneutic of healthy skepticism.

It wouldn't be unheard of to have such a framework in place - most researchers are cautious of researcher degrees of freedom, which can often result in skewed statistical estimates from mathematical models.

Data visualizations are generally presented to a wider and less questioning audience than formal statistical results.

Not Enough Information

Another issue that comes up from visualized data is the information that figures leave out. The military wanted to reduce the number of bombers shot down, so Wald worked with Columbia University’s Statistical Research Group during World War II. They suggested enhancing security in the areas of the plane that were most damaged when it returned, in order to achieve this goal.

Wald came to the realization that it wasn't the location of where hits on returning planes occurred that was important, but rather where these planes were not hit. This was likely due to the fact that bombers that never made it back had taken hits in the same areas that showed no damage on the returning planes. Thankfully, Wald used a deep understanding of statistical distributions to correctly identify what the visual evidence on the returning planes was indicating. Without Wald's insight, the visual evidence would have led to a waste of resources as the military over-armored the least vulnerable parts of the aircraft, costing the lives of more Allied aircrews.

The Challenger space shuttle exploded 1.5 minutes after lift-off in 1986, due to data that was missing from a visualization. The explosion was caused by a hold in the motor, due to a failure of the O-ring seal to expand properly and fill its section.

The investigation following the crash discovered that, prior to launch, a graphic had displayed the number of O-rings in distress per practice launch, as well as the ambient air temperature on the date of the practice launch, for launches from which at least one distressed O-ring was recovered. By only charting the flights that had at least one distressed O-ring, the majority of practice launches, which had no signs of O-Ring distress, were left off of the graph.

All of these launches took place on days when the temperature was between 65 and 81 degrees Fahrenheit. The launch exhibiting the most O-ring-distress occurrences happened during a morning with an outside temperature of 51 degrees. The ambient air temperature on the morning of the explosion was 31 degrees. However, as the commission investigating the explosion found, the missing data from the graphic led to a false belief that the association between air temperature and the probability of O-ring failure was small to non-existent. This, in turn, led to the disastrous call to go ahead with the launch on an unseasonably cold morning that cost seven individuals their lives. As the commission investigating the explosion found, missing data from the graphic led to a belief that the association between air temperature and the probability of O-ring failure was small to non-existent.

Visualizing Smarter

In a world increasingly dominated by data visualizations, how can a person avoid drawing mistaken or unjustified conclusions from these graphics? In a March of 2020 article, Amanda Makulec offered 10 suggestions for avoiding misinformation. She followed up on these recommendations at the end of last year. Her pointers, which are in the context of information concerning COVID-19, offer the best means by which to guard against being misled by visuals.

The most important rule is this: do more to understand your data than simply downloading it and plugging it into a dashboard or graphic. If you don't take into account the complexities of data collection—for example, what the numbers actually represent, how they were reported, who collected them, and other factors—you run the risk of misleading your audience as easily as you could enlighten them. If you're not familiar with your data, there's a good chance your audience won't be either.

According to Makulec, data can be highly misleading if the base groups for comparing metrics are different or not readily identifiable. The problem of inconsistent data affects many disciplines, from how countries count COVID cases to how municipalities compile information on crime, or even how companies compute the proportion of employees who have terminated their employment in the last year.

It is crucial to represent uncertainty as well as known data in a visualization. Makulec offers the important suggestion that visualizations should be honest about what is not being represented- in other words, to visualize not only the known data, but also uncertainty.

Don’t Be Deceived by Data Visualization

Data visualization can be used to find patterns in data, and also to teach about the important aspects of collected information. If data visuals are not created thoughtfully, they can be just as misleading as traditional statistics. It is just as important to be skeptical of data visualizations as it is at every other level of statistical analysis.

Developer Jobs in Austria

This might also interest you