Bayes Factor blues, and an unfashionable defence of p-values

Like many researchers, I have been trying to up my inferential game recently. This has involved, for many projects, abandoning the frequentist Null Hypothesis Significance Testing (NHST) framework, with its familiar p-values, in favour of information-thereotic model selection, and, more recently, Bayesian inference.

Until last year, I had been holding out with NHST and p-values specifically for experimental projects, although I had mostly abandoned it for epidemiological ones. Why the difference? Well, in the first place, in an experiment or randomized control trial, but usually not in an epidemiological study, the null hypothesis of no effect is actually meaningful. It really is something you are really interested in the truth of. Most possible experimental interventions are causally inert (you know, if you wear a blue hat you don’t run faster; if you drink grenadine syrup you don’t become better at speaking Hungarian; and if you sing the Habanera from Bizet’s Carmen in your fields, as far as I am aware, your beetroots don’t grow any faster). So, although the things we try out in experiments are not usually quite as daft as this, most interventions can reasonably be expected to have zero effect. This is because most things you could possibly do – indeed, most of the actions in the universe – will be causally inert with respect to the specific outcome of interest.

In an experiment, a primary thing we want to know is does my intervention belong to the small class of things has a substantial effect on this outcome, or, more likely, does it belong to the limitless class of things that make no real difference. We actually care about whether the null hypothesis is true. And the null hypothesis really is that the effect is zero, rather than just small–precisely zero in expectation – because assignment to experimental conditions is random. Because the causally inert class is very large whereas the set of things with some causal effect is very small, it makes sense for the null hypothesis of no effect to be our starting point. Only once we can exclude it – and here comes the p-value – do other questions such as how big our effect is, what direction, whether it is bigger than those of other available interventions, what mediates it, and so on, become focal.

So, I was pretty happy with NHST uniquely for the case of simple, fully designed experiments where a null of no effect was relevant and plausible, even if this approach was not optimal elsewhere.

However, in a few recent experiment projects (e.g. here), I turned to Bayesian models with the Bayes Factor taken as the central criterion for which hypothesis (null or non-null) the data support .(For those of you not familiar with the Bayes Factor, there are many good introductions on the web). The Bayes Factor has a number of appealing features as a replacement for the p-value as a measure of evidential strength (discussed here for example). First, it is (purportedly, see below) a continuous measure of evidential strength – as the evidence for your effect gets stronger, it gets bigger and bigger. Second, it can also provide evidence for the null hypothesis of no effect. That is, a Bayes Factor analysis can in principle tell you the difference between your data being inconclusive with respect to the experimental hypothesis, and your data telling you that the null hypothesis is really supported. The p-value cannot do this: a non-significant p-value is, by itself, mute on whether the null is true, or the null is false but you don’t have enough data to be confident this is the case.

Finally, perhaps the most appealing feature of the Bayes Factor is that you can continuously monitor the strength of evidence as the data come in, without inflating your false positive rate. Thus, instead of wastefully testing to some arbitrary number of participants, you can let the data tell you when they decisively support one hypothesis or the other, or when more information is still needed.

All of these features were useful, and off I and my collaborators set. However, as I have learned the hard way, it gets more sketchy behind the facade. First, the Bayes Factor is terribly sensitive to the priors chosen for the effect of interest (even if you choose the widely used and simple Savage-Dickey Density Ratio). And often, there are multiple non-stupid choices of prior. With modest amounts of data, these choices can give you Bayes Factors not just of different magnitudes, but actually in different directions.

And then, as I recently discovered, it gets worse than this. For modest samples and small effects (which, realistically, is the world we live in), Bayes Factors have some ugly habits. As the amount of data increases in the presence of a true experimental effect, the Bayes Factor can swing from inconclusive to rather decidely supporting the null hypothesis, before growing out of this phase and deciding that the experimental hypothesis is in fact true. From the paper it is a bit unclear whether this wayward swing is often numerically large enough to matter in practice. But, in principle, it would be easy to stop testing in this adolescent null phase, assuming that you have no experimental effect. If substantive, this effect would undermine one of the key attractions of using the Bayes Factor in the first place.

Disorderly behaviour of the Bayes Factor as sample size increases in the presence of a true effect. Shown is the log Bayes Factor for the null (i.e., higher is more null). From this paper by Leendert Huisman.

What to do? Better statisticians than me will doubtless have views. I will however make a couple of points.

The first is that doing Bayesian inference and using Bayes Factors are very much not the same thing. Indeed, the Bayes Factor is not really an orthodox part of the Bayesian armamentarium. People who are Bayesians dans leur ames don’t discuss or use the Bayes Factor at all, They may for all I know even regard them as degenerate or decadent. They estimate parameters and characterise their uncertainty around those estimates. The sudden popularity of Bayes Factors represents experimentalists doing NHST and wanting a Bayesian equivalent of the familiar ‘test of whether my treatment did anything’. As I mentioned at the start, in the context of the designed experiment, that seems like a reasonable thing to want.

Second, there are frequentist solutions to the shortcomings of reliance on the p-value. You can (and should) correct for multiple testing, and prespecify as few tests as possible. You can couple traditional significance tests on the null with equivalence testing. Equivalence testing asks the positive question–is my effect positively equivalent to zero–not just whether it is positively different from zero. The answers to the two questions are not mutually coupled: with an inconclusive amount of data, your effect is neither positively different from zero, nor positively equivalent to it. You just don’t know very well what it is. With NHST coupled to equivalence testing, you can hunt your effect down: has it gone to ground somewhere other than zero (yes/no); and has it gone to ground at zero (yes/no); or is it still at large somewhere? Equivalence testing should get a whole lot easier now with the availability of the updated TOSTer R package by Aaron Caldwell, building on the work of Daniel Lakens.

Equivalence testing, though, does force us to confront the interesting question of what it means for the null hypothesis to be true enough. This is a bit weaker than it being truly true, but not quite the same as it being truly false, if you take my meaning. For example, if your intervention increases your outcome measure by 0.1%, the null hypothesis of zero is not literally true; you have somehow done something. But, damnit, your mastery of the causal levers does not seem very impressive, and the logical basis or practical justification for choosing that particular intervention is probably not supported. So, in equivalence testing, you have to decide (and you should do so in advance) what the range of practical equivalence to zero is in your case – i.e. what kind of an impact are you going to identify as too small to support your causal or practical claim.

So that seems to leave the one holdout advantage of the Bayes Factor approach, the fact that you can continuously monitor the strength of evidence as the data come in, and hence decide when you have enough, without causing grave problems of multiple testing. I won’t say much about this, except that there are NHST methods for peeking at data without inflating false positives. They involve correcting your p-values, and they are not unduly restrictive.

So, if your quest is still for simple tests of whether your experimental treatment did something or nothing, the Bayes Factor does have some downsides, and you still have other options.

The paradoxes of relational mobility

In some societies, people perceive that others can desert their existing social relationships fairly easy, in favour of alternative partners. In other societies, people feel their social relationships are more permanent fixtures, never able to be abandoned. Let’s call these high-relational-mobility societies and low-relational-mobility societies respectively. It seems intuitive that people’s trust of one another will be greater in low-relational-mobility societies than in high-relational-mobility societies. Why? Well, in those societies, you have the same interaction partners for a long time; you can know that they aren’t just going to walk away when they get a better offer; they are in it for the long term, and so their time horizon is indefinite. Seems like a recipe for trust.

Interestingly, the empirical relationship is the other way around: where relational mobility is high, people have higher trust. Moreover, as shown in a recent paper by Sakura Arai, Leda Cosmides and John Tooby, individuals who perceive that others could walk away from them at any moment are actually more trustworthy, and less punitive. How can we explain this apparently paradoxical relationship?

The answer resembles classical arguments for the invisible hand in economics. In a market where buyers can shift vendors easily, there are many vendors, and it is easy for new vendors to enter, then I can be pretty confident that the price and service will be good. A vendor who took excessive profits, who downloaded costs onto the buyer, or was generally obnoxious would instantly and en masse be deserted. In a competitive and free market with low entry costs, I can trust that pretty much any partner I meet should treat me ok, merely from the fact of their existence.

Two allegories: high relational-mobility-societies are like Parisians believe restaurants in Paris to be: necessarily good, because there are so many restaurants in Paris and Parisian diners are so discerning that any restaurant that was not amazing and good value would have already ceased to exist. (By the way, from my own experience, I am extremely sceptical about this, not the cogency of the explanation, but the generalisation about Parisian restaurants that it is supposed to be an explanation of. I have however heard it from several independent sources.)

Second allegory: low relational mobility societies are more like the academic publishing market. We are stuck with a few massive actors (you know the ones), and our individual addictions to the status and prestige indicators they control means we, as a community, accept appalling behaviour – profit gouging, dubious editorial practices, frankly crap service to authors – rather than walking away.

In a world where the others in your social network have the option to fairly easily walk away, you have to treat those others pretty well (so that they won’t); and, they have to treat you pretty well (so that you won’t). Of course, if this meant that all relationships became transitory, ephemeral interactions, this might become pretty lonely and grim. That is not necessarily the case however: the experience of interpersonal intimacy is actually higher in high-relational-mobility societies. People value deep, durable and predictable relationships; relational mobility gives them the chance to cultivate those that suit them; and use the nuclear option to ensure a minimum acceptable threshold.

This also links to coercion and punishment. Social relationships inherently involve conflicts of interest. Thus, at some point in a social relationship, you always find yourself wanting someone to do something different than they spontaneously want to. At this point, one option is to punish them: to impose costs on them that they will find aversive. This might sometimes be effective in changing their behaviour, but it’s a horrible and humiliating way to treat someone.

If you know a person cannot walk away, punishment is a pretty effective tool, since it changes the relative payoffs of their different options to the favour of the one you want them to choose; and they can’t do much about it. But, if you know that someone subjected to the humiliation of punishment could just exit, you’d be much better off not trying to punish them – why should they put up with it? You should compromise on your demands of them instead. Thus, counter-intuitively, a good outside option on both sides can in principle make social relationships more dependable, more mutually beneficial, and freer from interpersonal domination. (Though of course, if one party has exit options and the other doesn’t, that’s an asymmetry of power, and not likely to be healthy.)

People with higher relational mobility scores (this is the perception that others are more relationally mobile) pay less to punish in an economic game, in participants from Japan and the US. From this paper.

This is rich idea, foreshadowed and probed in Albert Hirschman’s classic book Exit, Voice and Loyalty. It seems to tie together lots of disparate applications. To link to one of my other areas of interest, it crops up in one of the arguments for Universal Basic Income. In a world where UBI gives every individual a minimal walk away option from every job, the labour market should get better. Humiliating and unhealthy employment practices should be lessened, as employers who treated their employees this way would go the (alleged) way of bad Parisian restaurants. This leads to the counterintuitive prediction that people would work more, or at least more happily and productively, in a world where they were paid a bit for not working.

More generally, relational mobility could play some role in explaining the expanding moral circle, the observation that as societies have become richer and more urbanised, the unacceptability of humiliation and cruelty has deepened, and been extended to successively broader sets of others. Surely, if modern economic development has done one thing, it has increased relational mobility, though unevenly (more for the rich than the poor for example, more in the metropolis than the periphery). Perhaps this is the cultural consequence.

At least, modern economic development has increased relational mobility relative to (often authoritarian) agrarian and early modern societies. Some foraging societies were probably rather different. There’s a long-standing anthropological argument that one factor maintaining egalitarianism and freedom from domination in mobile hunter-gatherers is the ability of the dominated to simply melt away and go elsewhere. Much more difficult when you are tied to a particular plot of land or irrigation resource.

To pivot to an entirely different level of analysis, narcissists, famously, have highly conflictual interpersonal relationships that nevertheless persist for years (often at great cost to the partner). Narcissists are particularly prone to trying to control their partners through punishment. Although they frequently threaten to leave (presumably as a punishment), they seldom actually do. One thing that may be going on here is that narcissists have such an inflated sense of their own worth that they struggle to believe their partners could have outside options (there is some evidence consistent with this). Thus, they like to stay, and manipulate their partner into continuing to provide benefits from them, without feeling any imperative to treat that partner well in turn.

All in all, the topic of relational mobility, at all kinds of scales, seems like an important one that requires further unifying research and theory. Is higher relational mobility an unalloyed good? Does it come with particular psychological or political costs? How does it relate to the balance of kin-based and non-kin based relationships, which has been implicated in the social evolution of trust and of economic institutions? What role does it play in ‘modernity’ more generally?

Perhaps most pressingly for me, can the power of relational mobility be harnessed from the political left? The celebration of the positive power of consumer choice has come to be strongly associated with the neoliberal right. It’s easy to see through the smokescreen here: neoliberal dismantling and privatisation of public services was rhetorically justified by the progressive power of consumer choice. In many cases it actually ended up meaning the handover of a lot of public and household money to unaccountable capitalist oligopolies (chumocracies, indeed), without much practical empowerment of the citizen. Still, people on the social democratic left have an uneasy relationship with the idea that the citizens ought to be able to choose, including choosing to opt out. Maybe, though, as in the Universal Basic Income example above, there are instances where the left can make friends with the idea.