Are You Wasting Money on Big Data?

Are You Wasting Money on Big Data?

The proliferation of AI solutions has made big data more popular, but you can probably find a less expensive way to generate useful and actionable insights.

Three U.S. economists were awarded the Nobel Prize in Economic Sciences for their contributions to labor economics and econometrics. While Joshua Angrist, Guido Imbens, and David Card may not be household names for most data practitioners, the most important theme of their work (and that of their frequent collaborator, Alan Krueger, who surely would have shared in the prize but for his tragic suicide in 2019) is likely familiar. By using statistical methods to analyze data, it is possible to make estimates about the explanations for certain patterns.

The "credibility revolution" in econometrics was started by these four researchers. The movement was mainly concerned with how best to design empirical research in order to gain causal explanations using statistical evidence. The work done on research design not only provides insight into why various things may occur but also suggests an alternative to the current trend of collecting copious amounts of disparate data for training algorithms to sort out factors responsible for data regularities.

This alternative provides a particularly viable method for addressing business-decision-relevant questions relative to the costs of getting big-data infrastructure in place. What types of non-monetary benefits are most effective in reducing employee turnover?

You can gain meaningful insights at a lower cost by designing research based on thinking experimentally.

The credibility revolution and the experimental design

The credibility revolution is concerned with research design in order to provide insight into why various things may occur and offer an alternative to the current trend of collecting data for training algorithms to sort out factors responsible for data regularities.

Data analysis gets costly

It is a common mistake for companies to assume that data collected and disseminated for financial or other compliance-reporting purposes can be easily expanded to generate data sets large enough to train AI-based algorithms for predictive inference.

Data collection for the purposes of compliance or reporting are typically created specifically for the department or line of business that needed them. However, data collected for AI-caliber analysis requires a general hierarchy of essential components for regular, enterprise-wide or even extra-enterprise data collection, processing and dissemination. Not only is it necessary to collect data and create a structure for it across different areas of an organization, but you must also gather, revise, and assemble data from outside the organization (for example, census figures, unemployment statistics, or other metrics from the federal government, individual states, or localities) to make it fit the company's data architecture.

In addition, data used for predictive inference must be designed specifically for that purpose, rather than for end-use in a dashboard or Excel spreadsheet. Data-savvy companies can find it difficult to surmount these challenges efficiently and cheaply.

If companies succeed in streamlining and scaling their data-collection processes to get to regularized estimation and prediction using big data, the quality of the initial methods used to assemble the data might still constrain them. Data quality issues such as biased sampling, poorly defined metrics, and incorrect or inappropriate methods will not be resolved by having a functioning data-production pipeline.

Experimentation Without Big Data

Our credibility revolutionaries can help provide an alternative to these extensive, tough-to-implement data processes. While getting the useful data-collection infrastructure set up to sustain big-data comprehension in the short run (or whatsoever) may not be possible, businesses shouldn’t rule out their potentials to gather factual information and achieve believable outcomes. The work of this year's Nobelists on identifying how disruptive, random events can mimic the random, clinical assignment of patients to treatment and control groups was a big factor in their being awarded the prize.

Idiosyncratic events (such as pandemics, wars, terrorist attacks or natural catastrophes) can't be planned, but because they have different effects on different groups of people they can act as clinical trials. This approach relies on the assumption that, had the event not occurred, these quasi-treatment and quasi-control groups of people would have had similar outcomes.

A classic problem in social science research is the interaction between changes in crime rates and the number of police officers assigned to an area. The difficulty of teasing out whether police presence has an impact on crime results is because assigning more police officers to an area could occur in response to elevated crime rates in that sector. If a researcher were to run a statistical analysis of police increases on crime rates, the correlation between the two variables would be positive.

The causal story is obscured by the statistical analysis, which suggests that the increase in police caused an increase in crime, rather than the intended decrease. This finding not only misleads readers but also answers the wrong question: namely, what happened after police were reassigned? The more relevant question for estimating a direct effect is what would have happened to crime rates had police not been reassigned.

Natural Experimentation

Two economists looked to provide an answer to the aforementioned question by using a natural, disruptive event as a stand-in for the random assignment of more police patrols. Jonathan Klick and Alex Tabarrok argued that changes in the Homeland Security Advisory System (HSAS) caused a disruption in police presence. The police presence increased due to the HSAS threat level increasing, which caused crime rates to fall; even though this was not related to local crime rates. By taking advantage of these random fluctuations in terrorist-attack probabilities, Klick and Tabarrok were able to demonstrate the direct impact that increased police patrols had on lowering crime rates in the affected areas; thus breaking the circular logic of the chicken-and-egg problem between police and crime rates.

An example of using a "natural experiment" to better understand economic outcomes was provided by David Card, one of this year's prize winners. A card investigated the wage and unemployment effects after the 1980 Mariel boatlift, which began with Fidel Castro's announcement on April 20, 1980 that any Cubans who wanted to leave the country could do so by boat at the Cuban port of Mariel. According to Card's argument, the sudden influx of mostly low-skilled young men from the boatlift represented a natural experiment.

If there is a sudden increase in labor supply without a simultaneous increase in labor demand, then economic theory suggests that wages will be driven down and unemployment will rise as new workers compete for a shrinking pool of jobs.

The effects of this would have been particularly severe among jobs that employed low-skilled young men, especially those who were not Mariel Cubans within the same demographic categories.

The card compared wage and unemployment rates in Miami from 1979 to 1985. He also compared these rates with identical demographic groups of workers in a selection of other American cities to see if there were any labor-market effects of the refugee influx. The study found that wage and unemployment trends in Miami were only slightly different from other cities before and after the Mariel refugees arrived. Card's work has had a lasting impact because it was one of the first to investigate potential causes of effects using naturally occurring events in a quasi-experimental way.

Good Experimental Protocols

Identifying sources of randomness that can roughly mimic random assignment in clinical studies represents an excellent opportunity for quantifying purportedly causal effects.

Natural experimental methods come with a price; they rely on the assumption that the effects being estimated have been effectively randomized via the natural process within populations. The explicit randomization conducted prior to clinical trials is replaced with this assumption; however, there is no statistical method by which to ensure it is true.

The assumption that increases in the color-coding of the HSAS are not associated with increases in metropolitan crime rates is made under the condition that policing is taking place. The debate surrounding the assumption that Mariel boatlift laborers in Miami prior to 1980 had no association with the Mariel emigres is important.

What if, for example, the DC municipal police have close relationships with the US Marshals, Department of Homeland Security personnel, or members of the armed services at the Pentagon? The natural assignment of groups to conditions is not random, and this will undermine the proposed causal effect.

Building Your Own Experiments

By randomly experimenting with their own business processes, companies can sidestep the natural-randomization assumptions that quasi-experimental methods turn on.

To estimate the revenue impacts of different marketing strategies for a new sale, for instance, a firm might randomly assign different advertising formats among known customers, between markets, or between store locations within markets. The firm has actively designed its own experiment by allocating specific treatment and control conditions, rather than relying on random chance.

This is not up to the gold standard of an actual clinical trial because people in markets are not groups in laboratory settings where mitigating factors can be closely regulated. It is also possible for customers to be lost to other firms or markets, and the individuals assigned to treatments and controls might never be known, such as when both treatment and control groups comprise geographic areas or store locations.

While not quite a random-control trial, small-scale experimentation involving random assignment can offer deeper, more reliable conclusions than natural experimental methods.

Don’t Rely on Data Alone

It can pay, in terms of dollars and labor-hours spent as well as potential systemic failures, to increase investment into new data infrastructure or kludge together disparate data systems whose incompatible designs serve different corporate purposes. When the potential benefits of better estimation quality and a causal explanation for why something occurred represent can outweigh the drawbacks of attempting to institute AI-based solutions, this can provide motivation for doing so even when they are too costly or impossible to implement.

Designing experiments to identify potential causes of effects can yield valuable insights at cost. Being able to determine the cause and effect of things allows us to see which processes are more or less effective.

By thinking in terms of experiments - such as randomly shutting off a marketing channel's spending, assigning different hours-tracking systems to a random subset of workplaces, or comparing how workers take sick leave across states with different leave policies - can both provide useful information to those who make decisions and give them an explanation for why they made those specific choices.

Besides, AI hasn’t even won a Nobel prize (yet).

This might also interest you