Deflating Bayes’ Theorem
It’s a trivial arithmetical fact that “x percent of y equals y percent of x”. But it can be useful to bear it in mind. For example, I find it harder to figure out “8% of 50” than “50% of 8” — the latter is obviously 4, because 50% of any quantity is simply one half of that quantity, and I can tell that half of 8 is 4 “without even thinking about it”.
The algebra of “x percent of y equals y percent of x” is completely straightforward:
x100 × y = x × y100 = y100 × x
Percentages are proportions, and in many situations we can figure out proportions of cardinal numbers (i.e. how many members are in a given set) even when we don’t know the cardinal numbers themselves. For example, I don’t know how many Uruguayans there are, but I do know that about half of them must be female, because Uruguayans are human, and about half of all humans are female.
Proportions are expressed by fractions, and the algebra of fractions can be useful in many other less-than-obvious ways. For example, for any non-zero values of x, y and a:
zx × xa = zy × ya
If we divide both sides of this equation by xa, we get:
zx = zy × ya xa
Let us call this last equation Φ. Please note that Φ is true for all non-zero values of a, x and y, no matter what quantities they stand for. For example, suppose Zeke has spoken to his rival Xavier, but they have yet to meet in person. Zeke wonders: when they do meet, will he be the taller of the two? If so, how much taller? He knows where he stands height-wise with Yuri, because they shared a house when they were students. He has a photograph of Yuri standing next to Angela, and has seen Angela standing next to Xavier in Zoom meetings, and he has noted their height ratios. Even if he does not know any of their actual heights, this is enough information to estimate how much taller (or shorter) than Xavier he will turn out to be (choosing the obvious schema of a = Angela’s height, etc.)
The usefulness of equation Φ isn’t limited to special quantities, such as those that are no greater than one, or to positive values only, or even to real numbers. But it really comes into its own when it is applied to sets which intersect. (In what follows, I use the words ‘set’ and ‘class’ interchangeability.)
Suppose that, within a larger “universal” set A (with cardinal number a), a set X (with cardinal number x) and a set Y (with cardinal number y) have intersection Z (with cardinal number z):
As an aid to the imagination, suppose A represents the class of all animals on earth. (“A is for animals” — to make things a bit easier to follow, I’ll use memorable combinations of letter-names and sets, even if they get a bit silly.) If we were able to count each animal individually, we would arrive at the (very large) number a. And suppose X represents the class of animals that lay eggs. (Mnemonic: the letter X sounds like “eggs”.) If we could count these individually, we would again arrive at another large number x, although it would be a considerably smaller number than a. Now suppose Y represents the class of snakes. (Mnemonic: the letter Y looks like a forked tongue.) The number y would also be large, but again, it would be smaller than a. Some snakes give birth to live young, and other snakes lay eggs, so the sets X and Y intersect, making the set Z. (Mnemonic: the letter Z looks like the numeral 2, here standing for the set of animals that belong to both X and Y.) Each of the sets A, X, Y and Z is non-empty, so each of the numbers a, x, y and z is non-zero.
In real life, counting of the above sort is practically impossible, of course. But often we are able to estimate proportions of large numbers like these, just as a moment ago I was able to estimate what proportion of Uruguayans are female. And with equation Φ, proportions rather than cardinal numbers are all we need to know.
Equation Φ allows us to calculate zx (the proportion of egg-layers that are snakes) from other proportions that we already know, namely zy (the proportion of snakes that are egg-layers), ya (the proportion of animals that are snakes), and xa (the proportion of animals that are egg-layers).
There’s a field of study in which we routinely deal with proportions rather than with cardinal numbers, namely, statistics. In statistics, equation Φ is widely known as Bayes’ Theorem, although in the present form it hardly deserves the name of something that calls for a “proof”, as it’s such a modest bit of algebra. A brief digression into statistical probability is all we need to re-write Φ in terms more familiar to students of statistics.
One of my main contentions is that we can (and should) understand all numerical estimates of “probability” in terms of proportions like the ones we’ve just been discussing. For example, consider the claim that “the probability of throwing heads when tossing a fair coin is one half”. All this means is that if we were to toss a fair coin an indefinitely large number of times, the proportion of heads that result would approach one half. Note that the actual number of tosses (i.e. the cardinal number of the set of tosses) is not specified — the more the merrier. What matters is the proportion as a limiting value.
In other words, numerical probabilities should be understood as relative frequencies among pluralities of things or events. So understood, the probability of a specific sort of occurrence (such as getting “heads” when tossing this coin) is the relative frequency of that sort of occurrence in a larger reference class of occurrences (such as tosses of this coin). Since the former is a plurality of things that are relevantly similar to one another in a specified way, I’ll call it the specific class.
Unhappily, most talk of probability has the superficial appearance of referring to a single or isolated event. We often speak of “the probability of rolling a heads” — the indefinite article superficially suggesting a singular event — in order to identify a specific sort of event, i.e. a plurality, the specific class. This can lead us astray. In fact I think this quirk of language is partially responsible for an unfortunate tendency to treat numerical probabilities as expressing something about beliefs — they supposedly “measure credibility”, or something of that sort.
Another quirk of language comes into play too: the reference class is often left implicit. I think these last two façons de parler are important sources of philosophical error, and I will return to them. For now, please note that it is salutary to explicitly identify the classes involved — and they always are involved, even if one of them “goes without saying”.
How does this work with our illustrative example? Let’s re-phrase one of the numerical proportions we’ve been discussing in terms of probabilities. Take the proportion of snakes that lay eggs, say. This is the ratio of the cardinal number of the intersection Z ( = the set of animals that are both snakes and lay eggs) and the larger set Y ( = the set of animals that are snakes) in the Venn diagram above. In other words, it’s the fraction zy. This quantity corresponds to “the probability of a snake being an egg-layer”. That might sound like a rather odd way of putting it, as we struggle to imagine artificial situations (such as that of Noah, wondering whether this or that snake is an egg-layer as they randomly slither aboard the Ark). However artificial such situations may sound, the containment of sets involved here is the very same as that of situations where we find it quite natural to talk of probability — “the chances of getting heads when tossing a coin”, “the probability of a child being female”, “the likelihood of an Irish person having red hair”, and so on. All such claims depend on a specific class (of coin landings with the heads face up, of births of female children, of people having red hair, etc.) and a reference class (of coin-tosses, of human births, of Irish people chosen at random, etc.). Any numerical proportion so ascribed is a fraction whose numerator is the cardinal number of the relevant specific class, and whose denominator is the cardinal number of the relevant reference class.
I mentioned above that although a reference class is always present in the background of any claim about numerical probability, it’s often left implicit. This happens when the reference class is the universal class, given the context. (Quite often, several universal classes can work just as well for a given context.) For example, a claim about “the probability of an animal being a snake” might be re-phrased as an apparently simpler claim about “the probability of being a snake”, as long as the context makes it clear we are only talking about animals. In such a context, ya — the numerical proportion of animals that are snakes — corresponds to “the probability of an animal being a snake”, which using a customary notation can be written as P(Y ). Likewise xa — the numerical proportion of animals that are egg-layers — corresponds to “the probability of being an egg-layer”, which can be written as P(X ).
In general, a fraction like sr — expressing a ratio of cardinal numbers of a specific class S and a reference class R — can be understood as “the proportion of members of class R that are also members of class S”. The numerator s cannot be greater than the denominator r, because S is a subset of R. In terms of probability, it expresses “the probability of members of R also being members of S”.
The reference class of “the probability of a snake being an egg-layer” is not the universal class A of animals in general, but rather X, the set of egg-layers, so it has to be made explicit. One way of putting this is to say that “the probability of a snake being an egg-layer” is a conditional probability: the proportion in question expresses the probability of an animal’s being an egg-layer given the prior condition that is it a snake. I don’t much like this way of putting things, since it may suggest — wrongly — that some probabilities do not have a reference class at all.
Given our current schema of letters, and the customary notation for so-called conditional probability, zx — the numerical proportion of egg-layers that are also snakes — corresponds to “the probability of an egg-layer being a snake” or in other words “the probability of an animal’s being a snake given the prior condition that it is an egg-layer”. In customary notation, this is written P(Y|X ). Likewise, zy — the numerical proportion of snakes that are also egg-layers — corresponds to “the probability of a snake being an egg-layer” or in other words “the probability of an animal’s being an egg-layer given the prior condition that it is a snake”. It is written P(X|Y ).
So we have xa = P(X ), ya = P(Y ), zx = P(Y|X ), and zy = P(X|Y ). Substituting these into Φ:
zx = zy × ya xa becomes P(Y|X ) = P(X|Y ) × P(Y )P(X )
This is the version of Bayes’ Theorem that is most familiar to students of statistics. There’s no doubt that this formula is very useful to statisticians, and to anyone who has to think about numerical probabilities. But more has been claimed for it — much more than is warranted, or so I will argue in future posts.
Conclusion
What is the relevance of all this? I hope I have convinced you that there is nothing mysterious or magical about Bayes’ Theorem — it can be derived from some rather elementary algebra, which can be applied to any sort of quantity, real or complex, positive or negative, as well as to the cardinal numbers of sets, as long as three of the four relevant quantities involved are non-zero.
The idea that Bayes’ Theorem captures the essence of rationality, or anything like that, is inspired by bad, old-fashioned epistemology. According to the tradition of Plato and Descartes, knowledge is justified true belief, and the main epistemic challenge we face is to seek and achieve justification. In Descartes’ time, the ideal of justification was certainty — i.e. “total” justification. Nowadays, with the development of an arithmetical treatment of probability, the ideal instead is supposed to involve partial but numerically measurable justification, often referred to in terms of “degrees of belief” — i.e. supposed measures of credibility or assurance. According to this very common way of thinking, probability is understood epistemically, as “how much we are entitled to believe” that a claim is true, rather than non-epistemically, as I have characterized it, in terms of relative frequencies. Inquiry is typified as a decision-procedure in which “what we go on to believe” is decided by “what we believe already”.
The way of thinking that I oppose takes belief to be one-dimensional (as it must be if it is to be “measured” on any sort of numerical scale) and inquiry to be one-directional.
Quine remarked that modal logic was “conceived in sin” (the sin of confusing use and mention). I think the application of Bayes’ Theorem to measures of credibility is similarly conceived in sin: the sin of confusing the subject matter of beliefs (what they are about) with the manner in which beliefs are held (the “degree” of their strength or credibility). Almost invariably, its defenders choose as “typical examples of belief” those that occur in gaming. To put it in the starkest terms, they treat the belief that “one in six throws of a pair of dice result in doubles” as a belief held to a degree of one sixth that “any given throw will result in doubles”. As I mentioned above, our linguistic habit of using singular terms to refer to pluralities aids and abets us in this sin.
In a succession of posts to follow this one, I will defend an alternative understanding of knowledge and belief, which I call contextual reliabilism, in which justification — or at least justification as traditionally understood — plays a very minor role. It rejects numerical measures of how much anything ought to be believed, and indeed of how much anything actually is believed. In simple terms, my view is that at any given time a given agent either does believe something or does not believe it, and that what is or is not believed is either true or false. A belief might be more or less well-entrenched in an agent’s belief system, but that is a highly contextual matter which resists numerical treatment. I will argue that beliefs are not one-dimensional, as they would have to be to be measured on a scale, but “multi-dimensional”: each belief is embedded in the believer’s system as a node in a web (to use Quine’s famous metaphor). And inquiry is not a one-directional march forwards, but rather a chaotic dance between our theories and the various aspects of reality they purport to describe.