Covid-19 insights through data

What data and statistics can and cannot reveal about COVID-19 disease?

11th of April 2020

A series on understanding the SARS-Cov-19 virus spread and death rates of its disease. Data, Statistics and Modelling.

Table of Contents

What the World Health Organisation forgot to tell about testing?

The World Health Organisation never made it clear why testing is of utmost importance. There are two purposes for extensive testing:

1. to diagnose patients for COVID-19 disease by using PCR tests which measure whether the virus is in a patient.

2. to learn how infectious SARS-CoV-2 virus is and how deadly is the COVID-19 diseases, by using antibody serological tests which measure presence of the virus antibodies in an individual.

Let us make it clear that there is no reliable data-evidence to prove that decisions for current measures of global lockdowns are necessary. We should count lost lives on both sides of the equation when trading off between lost lives due to lockdowns and lost lives due to overwhelmed health service facilities caused by the COVID-19. We might save lots of lives by locking-down countries and stopping the speed of transmission, but on the other side of the equation, we might lose far larger amount of lives as an aftermath of lockdowns. To be able to find the right balance and optimise the number of saved lives on both sides of the equation, we need to complete data-bases on COVID-19 and collect the necessary data.
The questions that we need to answer for the data to be our partner guide in finding the best equilibrium for societal trade-offs in times of this epidemic are the following:


To answer these two questions we need to test a representative sample of a population for SARS-CoV-2 antibodies or use some other design that enables accurate estimation of infection and case-fatality rates.

Why data insights on COVID-19 epidemics are biased?

For now, efforts have been made in direction of diagnosing, treating and isolating patients. Only those with COVID–19 symptoms were tested for presence of virus while we know nothing about the population of people who were infected, but experienced no symptoms. For example, evidence from Cruise ship Diamond Princess, an example of a specific population where everyone was tested and can serve as a benchmark shows that 46% of all infected didn’t experience any symptoms, not even a month after testing.

In the past weeks researchers developed models using different types of data to predict potential extents of COVID-19 spread. With some predicting mild spread of the disease and other predicting frightening extents of this pandemic. The scientific fact is that we do not know which model shows the true situation, because data on COVID-19 is severely incomplete from the perspective of being able to answer the research questions of interest in a reliable fashion.


When trying to understand case-fatality rates, data from the Diamond Princess Cruise Ship are a good example, not excellent, but good in terms of thinking about possible true case-fatality rates. The population was 3 711 among which 19% tested positive for the virus. 1.5% of those who tested positive required intensive care or respirator and 1.4% of those who tested positive died (measured case-fatality). The researchers from a London School of Hygiene and Tropical Medicine estimated case-fatality rate for a population that was represented by the sample of the individuals cruising with Diamond Princess to be between 0.38% and 3.6% using 95% confidence interval. They used this information to account for age-related biases in estimating case-fatality rates with data from China. They showed the estimate of case-fatality rate to be between 0.2% and 1.3% using 95% confidence interval.

Although the above estimates are only estimates of specific areas (Diamond Princess data and non-randomly sampled China data), the scientific approach of the above research is sound and insights are informative, but with limitations (uncertainty) that cannot be quantified. However, the non-quantifiable uncertainty can be understood by understanding the used data analysis procedures.

Based on the available data, our guess at the moment is that COVID-19 is similarly or more lethal than seasonal flu, but seems to spread faster. More about guesses here: why simple math is misleading and advanced modelling not helpful.

Can we compare data on COVID-19 to seasonal flu data?

We know that data on COVID-19 is incomplete and its statistics biased. What we do not know is how biased it is. By comparing the most comparative statistics we can learn what we know and what we do not know about this epidemic. However let us bear in mind couple of important facts: (i) the COVID-19 data is not representative of populations while seasonal flu data is; (ii) the COVID-19 data exists only for 4 months period, while estimates for the seasonal flu are yearly-based. Apart from that, both diseases are caused by a virus with similarities in symptoms and disease. What we do not know is, whether there are similarities in infection rates and case-fatality rates (how many deaths among those infected) as well?

Below are graphs of the data for the 87 the most affected countries, excluding Africa, on the WHO data of COVID-19 cases and deaths. This data is compared to the true estimates of seasonal flu related cases. Please read a disclaimer below before judging the graphs comparing countries on the following two statistics:

1. (COVID-19 related deaths / population of a country) * 100, compared to the estimates of seasonal flu related deaths for the USA and the World

2.  (COVID-19 positive tested cases / population of a country) * 100, compared to the population-based seasonal flu infection rates in the USA

The disclaimer about unquantifiable uncertainty due to biases in terms of comparing countries and beyond

These countries are labelled as the most affected based on the data collected by the World Health Organisation (WHO) on COVID-19. Countries are using different protocols for testing in terms of who they test and how many they test. The number of tested is not proportional to the size of a population which makes comparability across countries biased (some countries are testing more than others). It likely depends on resources and availability of tests. Sufficient data are not available to be able to account for these biases.

It could be safe to assume that the EU countries are following the same testing protocols therefore the comparability across the EU countries comes with less uncertainty.

The number of deaths related to COVID-19 are likely under-reported, e.g., not every person who died during that period of time was tested for the presence of virus.

On the other hand, seasonal flu is a well studied and understood topic, the infection rate estimates are reliable. With the virus that causes COVID-19 disease, no study has been done yet to reliably understand this statistics.

The testing strategy which includes only those who experience symptoms of the disease, but excludes those who have not experienced any symptoms or had only mild symptoms, does not enable accurate view of the state of this epidemics. The graphs below demonstrate a consequence of deriving information from such incomplete data.

Incomplete data produce biased insights. For example, measured infection rates for COVID-19 are only a fraction of the seasonal flu infection rates, whereas the number of COVID-19 deaths relative to the population of a country is for some countries much higher than in case of seasonal flu. Keep in mind that seasonal flu ratio is a yearly estimate, representative of a population, while the COVID-19 ratio includes measures on only the last four months. If this data would be considered to be complete, then we would conclude that the virus is not very infectious, but a way more deadly than the flu. However, this is not true. These insights are biased due to the available data being incomplete.

Based on all existent research on COVID-19, the most affected group of population are the elderly and a high number of deaths are related also to health-care systems being overwhelmed. However, the age structure is not that significantly different across the EU countries, with Italy indeed being the oldest European nation.

Another important fact to account for when putting the information in the bellow graphs into perspective is severity of lockdown policies across countries. The well-known countries with the mildest (partial) lockdown policies are Sweden, Netherlands, Switzerland and Singapore with the latter being the most advanced in terms of using data, science and technology to guide policies for controlling this epidemic.

Below is the list of all the countries that are included in the dataset from which the two graphs are derived. If you happen not to find a country of your choice on the graphs, it is because the values are too small to be on the list of 38 highest values for COVID-19 data. This does not mean that other countries are less affected. They might simply have less resources to test for the disease. As a result the number of infected cases is lower and consequently the number of deaths related to COVID-19 as well. More data is needed to evaluate this claim.

Albania, Algeria, Argentina, Armenia, Australia, Austria, Azerbaijan, Bangladesh, Belarus, Belgium, Bhutan, Bosnia and Herzegovina, Brazil, Brunei, Bulgaria, Cambodia, Canada, Chile, China, Colombia, Costa Rica, Croatia, Cyprus, Czech, Denmark, Ecuador, Egypt, Estonia, Finland, France, Georgia, Germany, Greece, Hungary, Iceland, India, Indonesia, Iran, Iraq, Ireland, Israel, Italy, Japan, Kazakhstan, Korea, Kyrgyzstan, Latvia, Lebanon, Liechtenstein, Lithuania, Luxembourg, Malaysia, Maldives, Malta, Mexico, Moldova, Monaco, Mongolia, Montenegro, Nepal, Netherlands, New Zealand, North Macedonia, Norway, Panama, Peru, Philippines, Poland, Portugal, Romania, Russia, San Marino, Serbia, Singapore, Slovakia, Slovenia, Spain, Sri Lanka, Sweden, Switzerland, Thailand, Turkey, Ukraine, United Kingdom, USA, Uzbekistan, Vietnam.

‘Total’ presents all the cases in the world.

Use the above information to understand uncertainties of information showcased by the two graphs below.


Warning: The graphs do not reflect the true state of COVID-19 epidemic, but only the part on which data is available. More on the COVID-19 missing data can be found here: Data as a guide to balance societal trade-offs in COVID-19 epidemic.

Please read the above disclaimer to learn about the used data, and unquantifiable uncertainty. For questions and comments, get in touch via


Warning: The graphs do not reflect the true state of COVID-19 epidemic, but only the part on which data is available. More on the COVID-19 missing data can be found here: Data as a guide to balance societal trade-offs in COVID-19 epidemic.

Please read the above disclaimer to learn about the used data, and unquantifiable uncertainty. For questions and comments, get in touch via

Statistics for Understanding – Statistics for Reliable Solutions – Statistics for Helping

Tarastats Statistical Consultancy | Fredrikinkatu 61A, 00100 Helsinki, Finland | |

Tarastats Statistical Consultancy © 2020