Your study should not be like a mansion

Lately, I’ve been coming across a lot of proposed study designs that were like mansions. There I was, appreciating the well proportioned main research questions and the generosity of the outcome measures, when a little door got opened up in the panelling, and it became evident there were whole wings beyond the part where I came in; wings with other measures, sometimes in wildly different styles, and objectives of their own, and additional treatments, and turrets and gargoyles and intervening variables. The wings were somehow connected to the hall I had entered by, making for one big, rambling complex. Yet, they were somehow separable, in that you could live for years in one without needing to go into the others. Indeed, you could imagine them being parcelled off into entirely separate flats.

A picture of Schloss Ringberg, Bavaria
Schloss Ringberg, Bavaria. Your study really should not be like this. If you want to see why, read about the lives of Friedrich Attenhuber and Duke Luitpold and you will see.

Your study should not be like a mansion. It should be more like a single room than a mansion. Your study should follow the principles of Bauhaus or Japanese minimalism. Clutter should be removed until rock bottom simplicity has been achieved; then the design should be decluttered all over again. The ambition should be repeatedly refined and made narrower. There should ideally be a single objective. Outcomes should be measured with the best available measure, and no others.  Control variables should be designed out of existence where possible.  Mediators and moderators – do you need them? Why? You haven’t answered the first question yet. The analysis strategy should have the aching simplicity of Arvo Part’s Spiegel Im Spiegel. Anything that can be put off to another future study should be, leaving this one as clear and austere as humanly possible.

I am aware that I always made my studies too complicated in the past, and I see the desire to do so almost without exception in the younger researchers I work with. I am wondering where the desire to over-complicate things comes from.

Part of it, I am sure, comes from the feeling that there is a potential upside to having more measures, and no cost. You’ve got the people there anyway, why not give them that extra personality questionnaire? Or stick in that extra measure of time perspective, or locus of control, or intolerance of uncertainty? The extra burden on them is small; and surely, if you have a superset of the things you first thought of, then you can find out all the things you first thought of, and maybe some more things as well.

We were taught to think this way by the twin miracles of multiple regression and the factorial experimental design. The first miracle meant, we thought, that we could put more predictors in our statistical model without undermining our ability to make estimates of the effects of the ones we already have. In fact, things might even get better. Our r2 value would only go up with more ‘control’ variables, and our estimates would become more precise because we had soaked up more of the extraneous variance.

The second miracle meant, in an experimental study, that we could cross-factor an additional treatment with the first, without affecting our ability to see the effects of the existing one. Let’s do the thing we planned, but have half the participants do it in an inflatable paddling pool, or wearing noise-cancelling headsets. Our ability to detect the original effect will still be there when you average across this treatment. And we will know about the effects on our outcome of being in a paddling pool, to boot!

The truth is, though, that nothing comes for free. Cross-factoring another experimental treatment can make it difficult to say anything very generalizable about the effects of the original treatment. We wanted to know whether, in the world, caffeine improves memory performance, and we discover that whether it helps or hinders depends on whether you are standing in a paddling pool or not. But, in life, in the real world conditions where one might use caffeine to boost memory, one has not, as a rule, been asked to stand in a paddling pool. What then is the take home message?

As for the miracle of multiple regression, this is even more problematic. The idea that including some extra variable X2 in your regression leaves you still able to estimate the effects of X1 on Y in an unbiased way holds only in a subset of the possible cases, namely when X2 has an effect on Y but is not affected by X1, Y or any of their unmeasured consequences.  It is very hard to be sure that these conditions apply to your study.  This fact is not widely appreciated, with the consequence that whole swathes of social and behavioural sciences include far too many variables in their regressions, including many that they should not (see here and here; I am looking at you sociology, and you, epidemiology). Your thing does not become more true if you have controlled for more other things; it usually becomes more obscure. In fact, if you see it in a complex analysis with lots of additional covariates (especially if you see it only then), this increases the chances that it is in fact a statistical artifact (here for a case study).

Another exacerbating factor is psychology’s obsession with identifying mediators. It’s all very well to show how to change some outcome, but what’s the mechanism by which your intervention works? Does it work by changing self-esteem, or locus of control, or stress? Again, we were taught we could answer mechanism questions at no cost to the integrity of our study by throwing in some potential mediating variables, and running a path analysis (where you run your regression model first without and then with the inclusion of the potential mediator, and compare results). But, again, with the exception of some special cases, doing this is bad. Not only  does adding a mediator often lead to overestimation of the degree of mediation, it actually imperils your estimation of the thing you cared about in the first place, the average causal effect. There is a whole slew of papers on this topic (here, here and here), and they all come to the same conclusions. Don’t clutter your study with mediators in the first instance; they will probably confuse the picture. Identify your causal effect properly and simply. Answering further questions about mechanism will be hard and will probably require new studies – maybe whole careers – designated for just that. (Similar comments apply to moderators.)

What underlies the impulse to over-complicate, at root, is, fear of being found insufficient. If I have only one predictor/manipulation and one outcome, how will I be judged? Can I still get published? Does if look tooo simple? What if the result is null? This is what I hate most about science’s artificial-scarcity-based, ‘significance’-biased, career-credentialing publication system. People feel they need a publishable unit, in a ‘good’ journal, which means they have to have a shiny result. They feel like they can increase their chances of getting one by putting more things in the study. This imperative trumps actual epistemic virtue.

So, complexity creeps in as a kind of bet-hedging against insecurity. Let’s add in that explicit measure of the outcome variable, as well as the implicit one. In fact, there are a couple of different explicit scales available: let’s have them both! That gives us lots of possibilities: the explicit measures might both work, but not the implicit one; or one of the explicit measures might look better than the other. There might even be an interaction: the treatment might affect the implicit measure  in participants who score low on the explicit measures – wouldn’t that be cool? (Answer: No). Even if the intervention does not work we might get a different paper validating the different available measures against one another. But the problem is that you can’t make a study which is at the same time an excellent validation study of some different measures of a construct, and also a test of a causal theory in that domain. It looks like a capacious mansion, but it’s just a draughty old house none of whose wings is really suitable to live in.

If you put in more objectives, more measures, and more possible statistical models, you are more likely to get a statistically significant result, by hook or by crook. This does not make the study better. We are drowning in statistically significant results: every paper in psychology (and there are a lot of papers) contains many of them. It’s not clear what they all mean, given the amount of theoretical wiggle room and multiple testing that went into their construction. Their profusion leads to a chaotic overfitting of the world with rococo ‘theories’ whose epistemic lessons are unclear.  We need fewer new significant results, and more simple and clear answers (even descriptive ones) to more straightforward questions. Your study could be the first step.

Perhaps the main unappreciated virtue of simpler studies, though, is that they make the researcher’s life more pleasant and manageable. (Relatedly, an often overlooked benefit of open science is that it makes doing science so much more enjoyable for the researcher.) When you double the number of variables in a study, you increase the possible analyses you might conceivably run by at least a factor of eight, and perhaps more. Don’t tell me you will have the strength of character to not run them all, or that, having discovered one of those analyses gets a cute little significance star, you will not fret about how to reframe the study around it. You will spend months trying out all the different analyses and not be able to make your mind up. This will be stressful. You will dither between the many possible framings of the study you could now write. Your partner will forget what you look like. Your friends’ children will no longer be toddlers and will have PhDs and children of their own. Under socialism, data analysis will be simpler than seems even imaginable under the existing forces and relations of production. Until then, consider voluntary downsizing of your mansion.

Note. Some studies have a lot of measures by design. I am talking about ‘general purpose’ panel and cohort studies like NHANES, Understanding Society, the SOEP, and the UK National Child Development Study. Rather than being designed to answer a specific question, these were envisaged as a resource for a whole family of questions, and their datasets are used by many different researchers. They have thousands of variables. They have been brilliant resources for the human sciences. On the other hand, using them is full of epistemic hazard. Given the profusion of variables and possible analyses, and the large sample sizes, you have to think about what null-hypothesis significance testing could possibly mean, and maybe try a different approach. You should create a Ulyssean pact before you enter their territories, for example through pre-registering a limit set of analyses even though the data already exist, and pre-specifying smallest meaningful association strengths, rather than null hypotheses. Even in these studies, the designers are conscious of trying not to have too many alternate measures of the same thing. Still, it remains the case that a lot of what I say in this post does not really apply to the designers of those projects. Your study should not be like a mansion, unless it actually is a mansion.