Data science without causal inference is like a fish without water

Learning about foundational concepts of causal inference is crucial for data science because most questions of interest are causal in their nature.

Whether we perform impact evaluations, A/B testing, quality control or clinical trials, causal inference is the method of choice.

Causal inference is one of the most complex data inference methods, but with high rewards in terms of provided insights. In its essence, the theory and methods behind causal inference enable us to analyse causal relationships and thus fully unlock the value that data holds

Being familiar with causal inference methods and techniques also equips us with problem solving skills that are in crucial in order to analyse data in a scientifically objective way.

The main problem of causal inference is missing data and how to handle it. Because incomplete data is a typical problem in data science (often associated with biased results), it is important to become familiar with causal inference methods and techniques since such knowledge enhances our skillfulness for dealing effectively with incomplete data.

Understanding incomplete data is crucial for handling missing data.

For a long time, causal inference was allowed to be performed only within a randomised  experimental framework. Due to recent developments in statistical-methodological science, we are now able to perform causal inference also with observational data, i.e., a non-randomised experimental data.

In recent years, causal inference has become one of the most popular methods in for analysing data. However, many still struggle with complexities of causal inference conceptual framework.

The conceptual framework of causal inference provides foundational knowledge about the required causal reasoning as also the use of modern statistical thinking to design studies and analyse data in a causal effect fashion. Such foundational knowledge is critical to be able to analyse causal relationships in a scientifically objective way.

Causal inference also provides us with understanding of the impact that study designs have on trustworthiness of obtained data insights as also on capacity to unlock the value that data holds.

Some examples of questions that causal inference can answer:

Not all causal questions can be answered.


For example, do black students perform better in education attainment than white or Hispanic students?

In this example, the race is considered to be the cause. However, because we cannot manipulate such a cause, meaning that there is no simple intervention with which we could transform a white person into a black, results of this study cannot be called causal effects, but rather associations which are conditional on a set of covariates used in comparative analysis.

Another example of a cause that cannot be manipulated is sex. We cannot give a magic pill to an individual and transform him/her into an opposite sex.

Within the randomised experimental framework we use intervention to manipulate units of one group in comparison. For example, we apply intervention to units of one group (usually called a treated group), while not applying it to units of another group, i.e., control group. The intervention is the known cause in the language of causal effect studies.

The known cause is the cause that can be manipulated. When we can define the known cause, we are able to use causal inference methods and techniques to perform causal effect studies also with observational data. However, we must make sure that when using observational data, we make all the effort to come up with two comparable groups, meaning, to have two approximately identical groups of units with respect to important covariates which can differ only with respect to applied intervention, i.e., the known cause.

Selection of covariates

A careful selection is of utmost importance, in order to be able to reconstruct observational data structure to mimic a data structure of a randomised experiment. Such reconstruction is a complex task, but in its essence it requires that we reconstruct an assignment mechanism of observational data to mimic an assignment mechanism of randomised experimental data.

What is Assignment mechanism?

In a two group experimental randomised design, units are assigned to either Group 1 or Group 2, popularly called a treated and a control group. The mechanism which assigns units randomly is called an assignment mechanism. Because with observational data such assignment mechanism either does not exist or it is broken, it is important to reconstruct it in a way to mimic an assignment mechanism of the randomised experiment.

The process of reconstructing the assignment mechanism can be in many ways considered an art work. Yes, science requires art! However, this ‘art’ requires from us to be well-familiar with the necessary causal inference assumptions and ways to satisfy them.

Causal inference without assumptions is mission impossible

There is a set of causal assumptions that are required to be satisfied with regards to study design and to be able to obtain trustworthy conclusions on causal effect estimates. Justifying causal assumptions is difficult. It requires creative thinking, modern statistical thinking and understanding about the science of causal thinking.

Understanding assumptions and how to justify them is of great importance when designing causal inference studies because effectiveness of causal designs depends on capacity to justify the required assumptions. The more effective the study design is, the better we can justify required assumptions. The importance of a good study design is of such that “Sometimes the design effort can be so extensive that a description of it, with no analyses of any outcome data, can be itself publishable” – Donald B. Rubin (2008) For Objective Causal Inference, design trumps analysis. The Annals of Applied Statistics.

To be able to design causal inference studies effectively, it is important to get familiar with conceptual foundations of causal inference. Causal inference is not an algorithm and neither an equation, but a methodological and analytical approach for analysing causal relationships that requires heavy use of ‘human-mind’ software. Learn more about causal inference’s foundations here.

Want to learn more about Causal Inference?

Join our online courses!

More To Explore