There is a concern that too many scientific studies are failing to be replicated as often as expected. This means that a high proportion is suspected of being invalid. The blame is often put on confusion surrounding the ‘P value’ which is used to assess the effect of chance on scientific observations. A ‘P value’ is calculated by first assuming that the ‘true result’ is disappointing (e.g. that the outcome of giving a treatment and placebo was exactly the same based on an ideally large number of patients). This disappointing true result is called a ‘null hypothesis’. A ‘P value’ of 0.025 means that if the ‘null hypothesis’ were true, there would be only a 2.5% chance of getting the real observed difference between treatment and placebo, or even a greater difference, in an actual study based on a smaller number of patients. This clumsy concept does not tell us the probability of getting a ‘true’ difference in an idealized study, based on the result of a real study.

Because it is based on random sampling model, a ‘P value’ implies that the probability of a treatment being truly better in a large idealized study is very near to ‘1 – P’ *provided* that it is calculated by using a symmetrical (e.g. Gaussian) distribution, that the study is described accurately so that someone else can repeat in exactly the same way, the study is performed with no hidden biases, and there are no other study results that contradict it. It should also be borne in mind that ‘truly better’ in this context includes differences of just greater than ‘no difference’, so that ‘truly better’ may not necessarily mean a big difference. However, if the above conditions of accuracy etc. are not met then the probability of the treatment being truly better than placebo in an idealized study will be lower (i.e. it will range from an upper limit of ‘1 – P’ [e.g. 1 – 0.025 = 0.975] down to zero). This is so because the possible outcomes of a very large number of random samples are always equally probable, this being a special property of the random sampling process. I will explain.

Figure 1 represents a large population two mutually exclusive subgroups. One contains people with ‘appendicitis’ numbering 80M + 20M = 100M; the other group has ‘no appendicitis’ numbering 120M + 180M = 300M. Now, say that a single computer file contains all the records of *only one* of these groups and we have to guess which group it holds. In order to help us, we are told that 80M/(80M+20M) = 80% of those with appendicitis have RLQ pain and that 120M/(120M+180M) = 40% of those without appendicitis have RLQ pain as shown in figure 1. In order to find out which one of the group’s records is in the computer file, we could perform an ‘idealised’ study. This would involve selecting an individual patient’s record at random from the unknown group and looking to see if that person had RLQ pain or not. If the person had RLQ pain we could write ‘RLQ pain’ on a card and put it into a box. We could repeat this process an ideally large number (N) times (e.g. thousands).

If we had been selecting from the group of people with appendicitis then we would get the result in Box A where 80N/100N = 80% of the cards had ‘RLQ pain’ written on them. However, if we had been selecting from people without appendicitis, we would get the result in Box B, with 120N/300N = 40% of the cards bearing ‘RLQ pain’. We would then be able to tell immediately from which group of people we had been selecting. Note that random sampling only ‘sees’ the *proportion* with RLQ pain in each group (i.e. either 80% or 40%). It is immaterial that the size of the group of people in figure 1 with appendicitis (100M) is different to the group without appendicitis (300M).

The current confusion about ‘P values’ is because this ‘fact’ is overlooked and that it is assumed wrongly that a difference in size of the source populations affects the sampling process. A scientist would be interested in the possible long term outcome of an idealised study (in this case the possible contents of the two boxes A and B) not in the various proportions in the unknown source population.

Making a large number of ‘N’ random selections would represent an idealized study. In practice we cannot do such idealized studies but have to make do with a smaller number of observations. For example, we would have to try to predict from which of these possible boxes with N cards representing ideal study outcomes we would have selected a smaller sample. If we selected 24 cards at random from the box of cards drawn from the computer file containing details of the unknown population and found that 15 by chance had ‘RLQ pain’, we can work out the probability (from the binomial distribution e.g. when n=24, r=15 and p=0.8) of getting 15/24 exactly from each possible box A and B. From Box A it would be 0.023554 and from Box B it would be 0.0141483. The proportions in box A and B are not affected by the numbers with and without appendicitis in the source population and are therefore equally probable before the random selections were made. This allows us to work out the probability that the computer file contained records of patients with appendicitis by dividing 0.023554 by (0.023554 + 0.0141483) = 0.6247. The probability of the computer file containing the ‘no appendicitis’ group would thus be 1- 0.6247 = 0.3753.

It does not matter how many possible idealized study results we have to consider; they will always be equally probable. This is because each possible idealized random selection study result is not affected by differences in sizes of the source populations. So, if a ‘P value’ is 0.025 based on a symmetrical (e.g. Gaussian) distribution, the probability of a treatment being better than placebo will be 1 – P = 0.975 or less if there are inaccuracies, biases, or other very similar studies that give contrary results, etc. These factors will have to be taken into account in most cases.

*Featured image credit: Edited STATS1_P-VALUE originally by fickleandfreckled. CC BY 2.0 via Flickr.*

“This clumsy concept [P value] does not tell us the probability of getting a ‘true’ difference (…)”

“So, if a ‘P value’ is 0.025 based on a ‘normal’ or Gaussian distribution, the probability of a treatment being better than placebo will be

1 – P.”

A contradiction?

Thank you for your comment. The ‘clumsy concept’ is the ‘actual observation, or some other more extreme hypothetical observation that was not seen’. This does not tell us immediately the probability of a ‘true difference’ in an RCT. The P value itself is only a probability, which is not the ‘clumsy’ bit of the concept. The inverse probability of the null hypothesis or something more extreme conditional on the observed result is only equal to P provided that the likelihood distribution is symmetrical (e.g. by fitting a Gaussian or other symmetrical distribution to the data). If a non-symmetrical (e.g. binomial) distribution is fitted, then there will be a fixed relationship between the ‘P value’ and the inverse probability of the null hypothesis or something more extreme but they will not be exactly the same. The ‘P value’ does not automatically tell us that the probability of a true difference during an RCT is exactly ‘1-P’. This only happens under special conditions.

Contrary to the above, it has been widely asserted that it is not possible to calculate the inverse probability of a parameter conditional on the observed data without making some unverifiable assumptions about the prior probability distribution of the possible true results. ‘Frequentist’ statisticians were not prepared to make such unverifiable assumptions but restricted themselves to dealing with likelihoods of data conditional on hypothetical parameters. However, Bayesians were prepared to make such assumptions in order to come up with a posterior probability. What this blog points out is that in the special case of random sampling, it is possible to calculate the inverse probability of a parameter conditional on the data. A frequentist can incorporate any ‘prior’ data by doing a meta-analysis. A Bayesian can get the same result by using the data’s prior probability distribution (which is equal to the likelihood distribution of any prior data, as the prior probabilities of random sampling outcomes are uniform).