A much needed lesson in statistics (Corona blog part 1)

Dec 3, 2020

Over the past 9 months, I have looked with some surprise and sometimes indignation at the 'scientific basis' that is used for the measures surrounding the SARS-COV-2 virus. Not that the Dutch science institute RIVM is not doing its job. It does. The problem is that both politicians and the media seem to have no idea how to deal effectively with the numbers produced by RIVM.

Even the Outbreak Management Team (OMT), of which you may hope that broad expertise is available, seems to be comprised of almost exclusively doctors / virologists who seem to be unable to interpret the numbers correctly and make a wise decision using them. The case has been completely taken out of context and the composition of the OMT is clearly at least partly to blame for this.

Who am I to make such a bold statement about this? As a designer of anti-spam systems, I am professionally involved in statistics on a daily basis. As a university educated engineer I have worked for years with both production statistics and lifetime statistics in the automotive sector. I am an expert in the field of measurement systems and the automated processing of data. I have designed several measurement systems and methods that are still used virtually unchanged in factories for mass production today after more than 15 years. You can say that I know something about statistics. You can also say that the rape of numbers as I have seen in recent months affects me personally.

I have hardly spoken out about this in public in recent months, but it enough is enough and it's past due to make a contribution. If only to feel like I've gone to the trouble of providing those around me with the much needed context around the forest of terms, numbers and seemingly half-hearted measures that have been poured out upon us in recent months. I plan to write a series of blogs in which I will provide this context, of which this is the first. Hopefully it brings perspective into a completely overwrought situation and brings peace to an environment of fear that is unbecoming of a healthy democracy.

The relativity of absolute numbers

Measurement data never stands on its own. Without a reference it is worthless. When a scientist sees a number, he wonders how that number came about. What was the measurement method? How reliable is the measuring instrument? How is the experiment set up? What were the assumptions surrounding the experiment? What outside influences played a role? Am I looking at the raw data or has the data been processed and if so how? How can the measured data be compared with previous measurements and measurements to come? All these questions should be asked for measurement data and good scientific work also answers these questions.

Recently, in the COVID-19 crisis, it has become customary to publicize measurement data. In itself that is a nice development, but if this data is picked up by the media without the usual reference, a basis for confusion and misinformation arises because the data has been taken out of context.

A good example is the daily publication of the number of positive PCR tests (often incorrectly referred to as 'new infections') as an absolute number. That number in itself says nothing at all if it is not accompanied by the number of tests performed; If I run twice as many tests today than yesterday, it is to be expected that in a stable situation today I will find twice as many positive test results. On the other hand, if I run half of the tests, I expect the number of positive tests to halve. The number of positive PCR tests as an absolute value is therefore directly dependent on the number of tests performed.

To indicate how biased this can be, let's take a look at the following graph:

The graph above clearly shows what is going on: Since week 44 (first week of November) the relative number of infections has shown a steadily decreasing line. There is no leveling off, the vaunted effect of the autumn holiday is nowhere visible and there is certainly no 'stagnation in the decline in infections' as reported by both the [media] (https://www.hartvannederland.nl/nieuws / 2020 / no-explanation-for-increase-number-of-new-corona-infections /) as some members of the OMT are alleged. In other words, the development is exactly as you would expect from a wave of contamination that has passed its peak and anyone who claims otherwise based on the absolute numbers is completely fooling themselves scientifically in my view. That a man like Kuipers [is surprised] (https://www.nu.nl/284813/video/kuipers-verbaasd-over-hoge-coronacijfers-geen-versoepelingen-met-kerst.html?redirect=1) about 'the high corona numbers' says more about his lack of statistical insight than about the behavior of the virus! Our media has sunk deep that this man gets a stage with this story without being put to the test. And in this way, based on completely meaningless numbers, Christmas may pass by many people.

The message is clear: Without the reference of the number of tests performed, comparing the current measurement data with that of the past becomes an impossible task. Yet our media continues to bombard the public with absolute numbers. This shows a highly distorted picture and increases anxiety.

The danger of percentages

Sadly, even relative numbers do tell us overly much. Since the percentage of positive tests is defined as the number of positive tests divided by the number of tests performed, this seems a fairer representation of the situation. And although a percentage instead of an absolute number is already a big improvement, you can also seriously question such a result.

This is because the group on which the tests are performed was not randomly selected; Only people with complaints may register for a test. And some of that group, which is already suspect in principle, cannot go on due to the fact that there is too little test capacity available. In statistics we refer to this as a 'biased sample'. It is almost a mortal sin in statistics to work this way and any statistician can tell you about the dangers of doing so.

If you want to see how the virus spreads, you have to choose the tested group at random. You will have to choose people with and without complaints from different zip code areas and from both densely and sparsely populated areas. Under the current test conditions you can say at most what percentage of people with flue-like symptoms have a positive PCR test.

In summary, the PCR test, as it is currently being performed, cannot in any way be traced back to the behavior of the virus and its spread. It is possibly a useful tool for early detection and isolation of infected people and even that is the question as I will explain in my next blog.

The PCR results provide no information on the actual spread of the virus. It is therefore quite pointless to report this to the Dutch people via the media, unless the intention is to keep the population in fear. For the latter it is of course an excellent means.

Death is the only reliable number

There is therefore only one option left to perform responsible statistical analyses, which is to use the CBS 'excess mortality' data. Of course, this also includes deceased people who have not had COVID-19, but with sufficient COVID-19 related deaths these are statistically negligible and this data is not affected by changing test policy, crowds in hospitals, age-related exclusion and other bias of your test sample.

However bitter it may be, death makes no difference and cannot be influenced by politics, making this data the most reliable available for statistical analysis. That is why in the coming blogs I will always use the mortality data in my analyses.

Our RIVM must be aware of this and will undoubtedly also use this data. Unfortunately, there is a major disadvantage to the data: changes in the course of the virus can only be seen many weeks after it happend and it is therefore of limited use as an instrument for anticipating the spread of the virus at an early stage.

Difficult choice

This puts our government in a difficult dilemma: Using data that is available early but is statistically unsound or base your policies on data that gives an accurate picture but that arrives too late to make adjustments. It seems logical to solve the biggest problem of the PCR test and to no longer apply the tests to only the group with complaints. As mentioned, a choice of random test candidates across different postcode areas would remove the 'bias' from the test sample and yield statistically useful results at an early stage. However, there is a very good reason why that is not being done. I will elaborate on that in my next blog.

The role of the media

I would also like to address our media here: It is your task to provide correct and verified reporting to Dutch society. You fail miserably in this respect. You must do sufficient fact-checking and, if you are unable to do so, obtain the necessary knowledge from independent experts. Obviously, you are unable to portray the situation in a balanced way and you are responsible for spreading fear and uncertainty in our society. You behave irresponsibly and on this subject you are comparable to the first amateur blog and gossip newspaper that you love to think you stand above! I advise you not to publish absolute figures anymore and even to be careful with relative numbers. Dive into the matter before coming up with the next nonsense article!

Conclusion

It is significant that there is not even 1 statistician in the OMT, while in my opinion that should be a permanent member. If I could give an advice to our government it would be this: Remove 3 doctors from the regular OMT team and replace them with a statistician, a mathematician and a psychologist to get a more balanced approach!