Number of icons in front of a medical doctor

Machine learning – finding causes or causing false findings?

David Gyllenberg

INVEST Blog 10/2021

Artificial intelligence such as machine learning techniques might facilitate human life. While automatization has been around for several decades, effective machine learning algorithms have entered our daily life during the last decade: streaming services suggest songs and videos that we may like, and online advertisements are tailored to us personally based on our clicks on the smartphone, just to mention a few examples. Given this rapid development, is there a limit what we could achieve with these techniques if we feed the algorithms with enough data? For example, could machine learning disentangle the causes for complex diseases?

As a physician specializing in psychiatric disorders with onset in youth, it would be particularly compelling to identify new developmental risk factors for psychoses by combing large datasets and machine learning. This could have implications how we conceptualize the disorders and facilitate early targeted interventions.

Before being carried away by hype and buzzwords, it is worth to take a step back and consider what we are trying to achieve. A task could, for example, be to predict whether a specific person will become diagnosed with a psychiatric disorder. This kind of task is termed a ‘prediction problem’, and machine learning can be useful when dealing with such tasks. Another task is to examine the associations between risk factors and a diagnosis. Such tasks are termed ‘inference problems’, and the utility of machine learning to solve inference problems is far from straight forward.

Could machine learning disentangle the causes for complex diseases?

If it is not straight forward to gain inference from associations using machine learning, why not just stop there and pick a different approach? To pick the best approach, let us first distinguish between two scientific concepts: theory-driven research and data-driven research. Obviously, theory-driven research typically contains some exploratory aspects, and data-driven research relies on centuries of theory-driven research. Nonetheless, from an analytical point-of-view, it is crucial to disentangle between the two approaches to make valid conclusions.

The workflow in theory-driven research typically contains the following steps: develop a theory, make a hypothesis based on the theory, design an experiment to test your hypothesis, collect data from the experiment and analyze the data by testing your hypothesis. On the contrary, for data-driven research the workflow might start at already-collected data and is followed by a number of analyses. This means that data-driven research needs to follow rigorous protocols to avoid replicability problems. By replicability we usually mean that other research groups will produce similar results in other datasets, which is a cornerstone in science. Unfortunately, there have been concerns about a ‘replication crisis’ in science according to survey published in Nature (Baker, 2016). That said, is there a place for data-driven research when the goal is to gain inference about risk factors? Yes. Sometimes available theory implies a complex relationship between risk factors and outcomes, but the theory gives very little guidance for specific hypotheses. Instead, the theory might give rise to tens, hundreds or even millions of hypotheses. To minimize the risk of not missing relevant hypotheses, data-driven approaches are to prefer despite their limitations.

An example of a research question where theory gives rise to a large number of hypotheses is the following: do risk factors interact with each other to increase the risk for psychosis? Theories imply that while several risk factors associate with psychosis, it is the combination of risk factors that associate with highest effect – with even higher effect than would be expected by adding the effects of single risk factors. The statistical term for this phenomenon is ‘interaction’, and a more layman term would be ‘synergy’.

A biological explanation could be that a person exposed to an environmental risk and another person with genetic vulnerability will both have a modest risk for a specific disease; however, when the person with genetic vulnerability is also exposed to the environmental risk, the genetic vulnerability is ‘activated’, which in turn lead to a very high risk for the disease. Shortly, it is a plausible theory. The problem is that such plausible theories do not pinpoint which specific combinations could be harmful; they only lay a common ground that combinations of factors can be harmful. In the end, a major goal for stydying combinations of risk factors it to identify causal combinations: if the cause is reduced by an intervention, the risk of the disease will be reduced.

If this is the ultimate goal, it is not enough to state that combinations of risk factors are important. We need to know which specific combinations we should focus on. Given that the number of possible combinations of factors increase exponentially by the number of variables – 10 factors produce 45 combinations, 20 factors produce 190 combinations, 30 factors produce 435 combinations etc. – we easily end up with a very large number of hypotheses.

Inspired by this background, we set out to ask three questions (Gyllenberg et al., 2020). Are interactions between risk factors for psychosis replicated in the literature? Can a machine learning algorithm identify interactions of risk factors correctly? Can we identify interactions between parental and sociodemographic risk factors for schizophrenia in big dataset?

First, in a mini systematic review, we identified 30 interactions in the scientific literature spanning over 20 years, but only three of the interactions were reported in more than one paper. This raises the importance of conducting studies that follow protocols that have chance to replicate in future studies.

Second, we programmed a computer simulation that allowed to test different kind of interactions in different types of invented data. Using this simulation, we examined if the analytic pipeline with a machine learning algorithm could identify the true programmed interaction, and if it identified incorrectly some other risk factors and interaction that were not present in our simulated dataset. Another way to put it: can we distinguish signal from noise? In the simulation, we showed that it is possible to correctly identify common interactions in large datasets with our analytic pipeline (publicly available online:

Third, in our register-based study of approximately 1500 schizophrenia cases and 3000 matched controls, we did identify main risk factors that have been replicated in other studies, but we identified no interactions with our analytic pipeline. Since we unfortunately did not have information on all previously reported interactions, we cannot not make conclusions regarding their replicability. Nonetheless, applying data-driven techniques to future studies with more neurodevelopmental features and even larger sample size can be a fruitful way of identifying novel modifiable risk factors for psychoses.

In summary, machine learning techniques can have a place to identify complex relationships between risk factors and psychiatric disorders. However, the algorithms do not provide any shortcuts to knowledge. The datasets need to be big and it is important to apply rigorous control for false positive findings – otherwise, the risk of identifying false findings remain high.

The author

David Gyllenberg, MD, PhD, is assistant professor (tenure track) in the INVEST-flagship at the Research Centre for Child Psychiatry, University of Turku; visiting researcher at the Finnish Institute for Health and Welfare; and working clinically at the Department of Adolescent Psychiatry, Helsinki University Hospital.


Baker M. Is there a reproducibility crisis? A Nature survey lifts the lid on how researchers view the crisis rocking science and what they think will help. Nature 2016; 533(7604): 452-5.

Gyllenberg, D., McKeague, I. W., Sourander, A., & Brown, A. S. (2020). Robust data-driven identification of risk factors and their interactions: A simulation and a study of parental and demographic risk factors for schizophrenia. International Journal of Methods in Psychiatric Research, e1834. doi:10.1002/mpr.1834


Leave a comment

Your email address will not be published. Required fields are marked *