To describe, to predict or to explain?
Published:
As researchers, it is easy to get lost in the specifics of our research question and lose sight of the big picture. I found that stepping back to consider the overarching aim of a study can greatly clarify decisions around study design, analysis and interpretation. At the core, most research falls into one of three broad categories: description, prediction, or explanation (i.e., causal inference). Yet, distinguishing among these objectives is not always straightforward. It took me way longer than I would like to admit to understand the differences between these overarching research objectives. This post aims to provide a summary table that highlights their key characteristics and differences. I hope it helps you develop a practical understanding of each.
Note: A fourth overarching research objective would fall under the “method” category. Such studies include statistical method and tools development, method comparison or simulation studies for example. This specific category won’t be addressed here for simplicity and because it is usually not a big source of confusion; at least, not as big as the others.
Why does it matter?
Each big category has its own “preferred” study design and data requirements. Identifying the overarching research objective early in the process really helps to clarify the study design and data requirements, analysis and interpretation of the data. In other words, the simple classification allows you to have a good idea of what you will need to answer a given research question. Adequately labelling the study also helps the peer review process, as it ensures that the scientific objective, methods and conclusion are well aligned.
A simple classification of quantitative research objectives
The simple classification in Table 1 is strongly inspired by work of Hernán, Hsu and Healy (2019), to which I have added nutrition and health examples. I also invite interested readers to read the blog Sorry, what was the question again? by Darren Dahly.
Table 1: A simple classification of quantitative research objectives
To describe | To predict | To explain | |
---|---|---|---|
Typical label | Descriptive study | Prediction research (diagnostic or prognostic study) | Causal inference |
Description | Describe a phenomenon | Generate a probability, a risk or a value of an outcome given known inputs, variables or features | Estimate the probability, risk or value of an outcome, had we modified something |
Goal | Know the extent of an issue or phenomenon of interest, surveillance | Decide the ideal course of action given probable future, allocate resources given probable future | Examine the effect of an action, determine if doing something works, identify the mechanism that explains an outcome |
Typical design | National survey, cross-sectional studies | Prospective studies, cross-sectional studies | Randomized experiments, prospective studies |
Common statistics | Mean, percentile, prevalence | Odds, risk or hazards, probability or mean | (Counterfactual) probability, risk ratio or difference, mean difference |
Key threats to validity | Representativeness of the target population (sampling), measurement error | Internal validity (e.g., model performance), external validity (e.g., performance in a different sample, ability to collect the variables) | Confounding and selection bias |
Nutrition and health research question | What is the prevalence of malnutrition among admitted hospital patients? | What is the 30-day risk of rehospitalization given age, dwelling and malnutrition status of admitted patients? | What would be the 30-day risk of rehospitalization, had we provided energy and protein supplements to admitted patients, vs., instead, had we not provided the supplements? |
Example interpretation of hypothetical findings | The prevalence of admitted patients with malnutrition is 60%. This is very high, confirming that malnutrition is a common issue among hospitalized patients | The 30-day risk of rehospitalization is 80% greater for an older adult, living in long-term care home, and with malnutrition compared with an individual without these characteristics. Individuals at higher risk of rehospitalization should be referred to a registered dietitian and kinesiologist for counselling. | Had we provided a protein/energy supplements to all admitted patients (vs. had we not), the 30-day risk of rehospitalization would have been 10% lower. Providing energy and protein supplements could be a potential strategy to reduce rehospitalizations. |
Next, I review common mistakes that happen when the research objectives above are confused with one another.
Common mistakes
Unnecessary confounding limitation
The mistake: A descriptive study (e.g., national survey or cross sectional study) includes a limitation that “causality cannot be inferred, given the cross-sectional nature of the data” or that “residual confounding remains a possibility”.
What is wrong: Confounding is a concept inherent to causal inference, that is, when the overarching research objective is to explain. Confounding is not applicable or not relevant when the purpose is to describe (Conroy & Murray, 2020). Indeed, the goal is to describe the world “as is”.
Explanation using example: Describing the prevalence of malnutrition among hospitalized patients does not require to account for confounding, because the statistic of interest is “as is”: the (crude) prevalence of malnutrition among hospitalized patients.
This would be the same if we were to assess the prevalence of malnutrition among subgroups separately. For example, the prevalence of malnutrition may differ among sexes. In that case, researchers may stratify their data to obtain sex-specific prevalence. To interpret sex-specific prevalence of malnutrition, could a third variable be a confounding factor, say, age? Should we further account for it? While it may be relevant to generate age-standardized statistics in some settings, age is not a confounder per se of the relationship between sex and malnutrition; “confounding” should not be the reason why age-standardized prevalence is required. Again, the (crude) prevalence within each sex is often the statistic of interest, because the aim is to describe the phenomenon as it is.
To go one step further, we may find that malnutrition is more prevalent in females than males. It might turn out that this observation is due to the fact than females live longer than males and, thus, that age is one of the primary reasons for that observation. Once we account for age, we may find that there is no longer a difference between females vs. males. Regardless, age would not be a confounder of this observation. The phenomenon of malnutrition is still more prevalent in females than in males.
A true limitation for a descriptive study could be that the sample to estimate the prevalence of malnutrition has been taken from a single department within the hospital (selection bias). The extent to which that patients in that department reflects patients from all departments is the relevant information to put findings into context.
Confusing prediction and causal inference
The mistake: The risk of an outcome is predicted using a multivariable model. Authors argue that variables may be appropriate target to modify the risk of outcome.
What is wrong: Predicting an outcome given inputs is not the same as doing causal inference. In other words, correlation does not equal causation. However, this correlation-causation fallacy can be harder to identify because prediction models often include many if not hundreds of variables. In this context, it may be tempting to conclude that any single variable or feature in the model reflects an independent causal pathway. The apparent complexity and sophistication of machine learning algorithms may further contribute to this issue. However, it is not possible to estimate the effect of multiple causes of an outcome in a single model, an issue known as the Table 2 fallacy (Westreich & Greenland, 2013).
Explanation using example: Living in long-term care home could be a major predictor of risk of rehospitalization. However, we (hope!) that this is not because older adults were living in long-term care homes. Rather, it is more plausible that these individuals have a much higher disease burden which prevented them from living in private dwellings in the first place. The high disease burden is more likely the single factor driving the observation that older adults living in long-term care homes have a higher 30-day risk of rehospitalization. In other words, in this example, intervening on an individual’s living arrangement would not modify the risk of rehospitalization, because the “true” underlying cause - disease burden - has not been modified.
Of note, being older (key variable: age) could be a strong predictor of rehospitalization as well as have a causal relationship with the outcome. Indeed, older adults tend to have higher health care needs. In a purely predictive study, we would not know which variables have a causal relationship with the outcome. Similarly, even if all variables included in a given prognostic prediction model have a causal relationship with the outcome (e.g., based on background knowledge and prior literature), the estimated effects of each variable will not correspond to their “true” causal effect (Westreich & Greenland, 2013).
References
Conroy, S., & Murray, E. J. (2020).Let the question determine the methods: descriptive epidemiology done right. Br J Cancer, 123(9), 1351-1352. doi:10.1038/s41416-020-1019-z
Hernán, M. A., Hsu, J., & Healy, B. (2019). A Second Chance to Get Causal Inference Right: A Classification of Data Science Tasks. CHANCE, 32(1), 42-49. doi:doi.org/10.1080/09332480.2019.1579578
Westreich, D., & Greenland, S. (2013). The table 2 fallacy: presenting and interpreting confounder and modifier coefficients. Am J Epidemiol, 177(4), 292-298. doi: doi.org/10.1093/aje/kws412