There is a concern that too many scientific studies are failing to be replicated as often as expected. This means that a high proportion is suspected of being invalid. The blame is often put on confusion surrounding the ‘P value’ which is used to assess the effect of chance on scientific observations. A ‘P value’ is calculated by first assuming that the ‘true result’ is disappointing (e.g. that the outcome of giving a treatment and placebo was exactly the same based on an ideally large number of patients). This disappointing true result is called a ‘null hypothesis’. A ‘P value’ of 0.025 means that if the ‘null hypothesis’ were true, there would be only a 2.5% chance of getting the real observed difference between treatment and placebo, or even a greater difference, in an actual study based on a smaller number of patients. This clumsy concept does not tell us the probability of getting a ‘true’ difference in an idealized study, based on the result of a real study.
Because it is based on random sampling model, a ‘P value’ implies that the probability of a treatment being truly better in a large idealized study is very near to ‘1 – P’ provided that it is calculated by using the ‘normal’ or Gaussian distribution, that the study is described accurately so that someone else can repeat in exactly the same way, the study is performed with no hidden biases, and there are no other study results that contradict it. It should also be borne in mind that ‘truly better’ in this context includes differences of just greater than ‘no difference’, so that ‘truly better’ may not necessarily mean a big difference. However, if the above conditions of accuracy etc. are not met then the probability of the treatment being truly better than placebo in an idealized study will be lower (i.e. it will range from an upper limit of ‘1 – P’ [e.g. 1 – 0.025 = 0.975] down to zero). This is so because the possible outcomes of a very large number of random samples are always equally probable, this being a special property of the random sampling process. I will explain.
Figure 1 represents a large population two mutually exclusive subgroups. One contains people with ‘appendicitis’ numbering 80M + 20M = 100M; the other group has ‘no appendicitis’ numbering 120M + 180M = 300M. Now, say that a single computer file contains all the records of only one of these groups and we have to guess which group it holds. In order to help us, we are told that 80M/(80M+20M) = 80% of those with appendicitis have RLQ pain and that 120M/(120M+180M) = 40% of those without appendicitis have RLQ pain as shown in figure 1. In order to find out which one of the group’s records is in the computer file, we could perform an ‘idealised’ study. This would involve selecting an individual patient’s record at random from the unknown group and looking to see if that person had RLQ pain or not. If the person had RLQ pain we could write ‘RLQ pain’ on a card and put it into a box. We could repeat this process an ideally large number (N) times (e.g. thousands).
If we had been selecting from the group of people with appendicitis then we would get the result in Box A where 80N/100N = 80% of the cards had ‘RLQ pain’ written on them. However, if we had been selecting from people without appendicitis, we would get the result in Box B, with 120N/300N = 40% of the cards bearing ‘RLQ pain’. We would then be able to tell immediately from which group of people we had been selecting. Note that random sampling only ‘sees’ the proportion with RLQ pain in each group (i.e. either 80% or 40%). It is immaterial that the size of the group of people in figure 1 with appendicitis (100M) is different to the group without appendicitis (300M).
The current confusion about ‘P values’ is because this ‘fact’ is overlooked and that it is assumed wrongly that a difference in size of the source populations affects the sampling process. A scientist would be interested in the possible long term outcome of an idealised study (in this case the possible contents of the two boxes A and B) not in the various proportions in the unknown source population.
Making a large number of ‘N’ random selections would represent an idealized study. In practice we cannot do such idealized studies but have to make do with a smaller number of observations. For example, we would have to try to predict from which of these possible boxes with N cards representing ideal study outcomes we would have selected a smaller sample. If we selected 24 cards at random from the box of cards drawn from the computer file containing details of the unknown population and found that 15 by chance had ‘RLQ pain’, we can work out the probability (from the binomial distribution e.g. when n=24, r=15 and p=0.8) of getting 15/24 exactly from each possible box A and B. From Box A it would be 0.023554 and from Box B it would be 0.0141483. The proportions in box A and B are not affected by the numbers with and without appendicitis in the source population and are therefore equally probable before the random selections were made. This allows us to work out the probability that the computer file contained records of patients with appendicitis by dividing 0.023554 by (0.023554 + 0.0141483) = 0.6247. The probability of the computer file containing the ‘no appendicitis’ group would thus be 1- 0.6247 = 0.3753.
It does not matter how many possible idealized study results we have to consider; they will always be equally probable. This is because each possible idealized random selection study result is not affected by differences in sizes of the source populations. So, if a ‘P value’ is 0.025 based on a ‘normal’ or Gaussian distribution, the probability of a treatment being better than placebo will be
1 – P = 0.975 or less if there are inaccuracies, biases, or other very similar studies that give contrary results, etc. These factors will have to be taken into account in most cases.
Featured image credit: Edited STATS1_P-VALUE originally by fickleandfreckled. CC BY 2.0 via Flickr.