COVID-19 probabilistic thinking

Michael Tamillow
8 min readOct 20, 2020

Approaching COVID-19 from the perspective of actual science has been difficult since in recent times the definition of science looks like this:

“Science is what is done by Scientists, and Scientists are those who do Science.”

This definition has been universally agreed on by everyone who matters. So, let’s do some not science.

I am first going to present some numbers that are generally available at this time. We will take it as a given that our numbers are (reasonably) true. Then, the numbers will be magically transformed by mathematics and the power of probability! This will yield the truth behind the data, where we must first test against the possibility that coronavirus is not a cause of death. We will discuss the potential unreliability to the upside and the downside, and potential issues that create non-uniformity over time and subpopulations, yielding the need for serious analysis if Science is ever going to belong to the thinking mind again. A lot of this will use Fermi estimation techniques for the numbers, but the laws of probability are as close to fact as you will ever get. Math is the best known route to discover truth.

My numbers come from Google, today is October 20, 2020: I see 220,000 coronavirus deaths for the United States. This actually means that a person was diagnosed with COVID and the person died. The CDC has some data on the positive tests vs. total reported tests. The ratio is most important, because it gives us a probability estimate. The numbers are positive tests: 6,873,739 and total tests: 79,611,982. The ratio is 0.08634. The probability of having coronavirus for a randomly tested person is thus:

P(C) = 6,873,739/79,611,982 = 0.08634.

If you were to go get a coronavirus test, you would have an 8.634% of testing positive. People may say there is self-selection bias, but with almost 80 million tests being performed, massive self-selection seems unlikely. The prevailing sentiment of the pandemic has reinforced random testing. This should be considered a reliable estimate.

The next number, and a very important one, is the number of deaths in the United states yearly. In 2017, there were 2,813,503. In 2018, there were 2,839,205. The changes from year-to-year are also reported, which are 69,255 and 25,702. The year 2020 is 2 years from 2018, to keep it simple, let’s just add 94,957 to the 2018 number. This gives us an expected count of people who should die this year in the US as 2,934,162. This makes sense. The population of the US right now is 328,200,000 (google search it). If you divide a lifespan of 100, by the population size we would have 3,282,000 deaths. You can imagine the theoretical world where everyone lives exactly 100 years and for every person who dies one is born. Because our true population is growing, the number of actual deaths is less than the theoretical example. We know that the probability of dying is almost entirely correlated with age. But let’s imagine randomly sampling someone and asking what is the probability of death. But wait, that is over a whole year! We only want the seven months that the epidemic has been happening over. That number is going to be (7/12) times the probability of death. So Probability of death (of any cause, during the epidemic):

Expected number of deaths: (7/12) * 2,934,162 = 1,711,595

P(D) = 1,711,595 / 328,200,000 = 0.005215

There is a long understood and well-defined principle of probability called independence. Independence means that two events are not related. Whenever we have the joint probability of two events, we want to know how likely it is that the events would occur independently from one another. Independence is defined as P(A & B) = P(A)P(B), whereas dependent events are defined as P(A & B) = P(A)P(B|A), where the | sign means “given”, but you can literally think of it as “dependent upon”.

Talking about cause of death in general is a rather difficult thing. Let’s take a relatively straightforward death. Someone gets in a car accident. A lot of people get in car accidents and don’t die. So, the cause is not obviously “a car accident”. Which part of the car accident caused the death? Was it the speed of the cars? Was it the angle of collision? Could we label it as “caused by drunk driving”? A lot of the issues around this are obscure, so analyzing the data requires causal inference. We know that COVID deaths mean “dying with COVID”, so that means independence is merely the probability of dying and the probability of testing positive. That probability is:

P(C & D) = P(C)P(D) = 0.00045027

Therefore, if COVID was not a cause of death, we would expect a random person in the past 7 months to have a 0.045027% chance of dying and testing positive. The number of people we would expect in a population of the US is:

Expected indep. death with COVID: 328,200,000 * 0.00045027 = 147,779

We are clearly in excess of that number. Using the standard scores we can find the statistical significance of this. First we should calculate the standard deviation:

stdev: sqrt((328,200,000 * 0.00045027 * (1–0.00045027)) = 384

The standard score is the difference between 220,000 and 147,779 (=72,221) divided by 384 (roughly 180). It is fair to say this is statistically significant (z-score > 3 is greater than 99% confidence, although modeling our inputs as random variables would vastly change the uncertainty modeled by this distribution), and barring complexity issues we can say we believe coronavirus has effectively killed people. By this simple model, roughly 72,221 excess deaths are due to coronavirus — or 32.82% of the reported number. Let’s look at the probability of death given you have a positive test. Using Bayes Theorem:

P(D|C) = (P(C|D) * P(D)) / P(C)

We know P(C) = 0.08634 and P(D) = 0.005215 and we can get the probability of coronavirus given you are dead. For this estimate, we need to take the number of deaths by COVID, and divide by the total number of expected deaths during the same time period. We need to remember to add on the excess deaths to the number of expected deaths. In our case that won’t have a huge effect, but if we were faced with a real severe epidemic, then the result of not adding that back could be a probability in excess of 1.0, which is not possible. (220,000 / (1,711,595 + 72,221)) = 0.12333. Now we can get our answer to the question “if you, a random person, (randomly) test positive, what are the chances you will die”:

P(D|C) = (0.12333 * 0.005215) / 0.08634 = 0.007449

So you had roughly a 0.7449% chance of dying if you are a random person diagnosed with COVID in the past 7 months. But wait! You have roughly a 0.5215% chance of dying if you are a random person at all, whether or not you had a COVID diagnoses. The difference between these is:

COVID excess death prob = 0.007449 - 0.005215 = 0.002234

Or a 0.2234% chance of dying in excess of the independent assumptions. This is consistent with the CDC’s infection fatality rates, so it is probably a good guess as to how they derived it. Proportionally, the chances of dying increase by:

Death rate increase = 0.002234 / 0.005215 = 0.4284

or 42.84%. Though this seems like a lot, we can make these probabilities equal in time by just extending the 7 month period out by 42.84%. So your chances of dying of COVID, if you test positive for having it in the 7 month period, are roughly the same as your chances of dying in a 10 month period without an epidemic. Personally, I find this neither shocking nor frightening.

Two facts to consider in discussion:

  1. We are all eventually going to die (despite silly theories to the contrary)
  2. The probability of dying differs vastly across demographics (mainly age)

The first point should be vitally clear in its implications. The probability of death, on average over time and population, is independent of all causes. Even considering that we have “eliminated all causes of death”, we would almost certainly not have eliminated death. Good luck trying. You will neither be the first who has tried, nor the only one with the ridiculous belief. Death does not require a cause per se, since it occurs in everyone, irrespective of causes. It is literally hardcoded into our DNA more deeply than any other facet of life itself.

However, what we have seen in terms of reporting data has been the philosophy of “A life is a life” all deaths + positive tests are reported as equals. This number 220,000 would be under-reported if not all people who die are tested or diagnosed, and over-reported if dying people are diagnosed when they would not have tested positive. It is likely both of these at the same time are true, since data is typically incredibly unreliable. If I were to guess where the incentive lies, I would say over-reporting, as the policy decisions have been unilateral prior to the collection of data, which might corrupt the data collection process and increase the estimates used here, particularly the probability of death in the 7 month period following initial lockdowns should be expected to rise. However, this is the data we have and for this reason we accept it as true. Furthermore, the fact that it fits with the scale of a minor epidemic, which is something we experience with regularity every few years, moves me away from large-scale conspiratory data corruption. It is much more clearly a lack of understanding of the information contained within data by nearly everyone.

If we investigate the data even closer, we see through late March, April, and May, the numbers are much higher. This could mean there was actually greater effect of coronavirus positive tests and deaths early, and fewer now. Or it could mean the probability of a random person dying spiked during a mass firing and lockdown, which would be consistent with most rational expectations. This probability of a random person dying would need to be 42.84% greater due to lockdowns since P(D) = P(D|C) if these conditions are independent. If we are able to access detailed mortality data for the year of 2020, this should become more transparent and testable.

The independence assumption should constantly be rechecked with statistical significance tests over the recent rolling aggregates of positive testing rate and number of deaths. Furthermore, the independence hypothesis should be heavily tested over demographics. It is very likely that for many subpopulations, the data demonstrates that coronavirus and death are independent variables. All of this could be done with samples rather than trying to get population parameters, and may even be more accurate and unbiased in this respect. We hear a lot of “second wave” talk with no one seeming to understand that independence means people will certainly die and test positive but that does not mean they are dying of COVID. The general ignorance of this principle has perpetuated poor decision making based on a misunderstanding of data. Nearly every introductory statistics class has taught what the null hypothesis means, but these fail to adequately teach how mathematics, probability, and statistics are interconnected in application.

So, all causes of death should be checked for independence against probabilities of dying. When a new technology arises that raises probability of death amongst a subgroup or a chemical can be shown to increase the probability of cancer, and therefor death, we can begin to declare dependence. We often intuitively look for mechanisms from sparse evidence, in this case the strength of our evidence is very important. When we take a large number of people with the same weak evidence and claim it is strong because of consistency in belief, we get groupthink. When people invest and sacrifice in beliefs that later turn out to have overwhelming evidence against them, and still refuse to modify them, we see belief perseverance. And when people are unable or unwilling to accept or comprehend the data and the mathematical analysis demonstrating its relevance in context, and imagine only the anecdotal situation so easily displayed in pictures, we are left with scope insensitivity. It appears that these three factors have driven a perfect storm and confessing it was all for naught is an unacceptable outcome now, but clearly it was.

--

--