40. Estimation of Corona cases with Python and Pandas

By Bernd Klein. Last modified: 26 Apr 2023.

On this page ➤

What is the real number of infected peoply by the corona virus. What we learn in the news is just the number of positively tested people. Unfortunately, it is technically not possible to test everybody with possible corona virus symptoms. For example we learned that on the 18th of March we had 12327 known cases in Germany. An extremely important question is "How many unknown cases do we have?".

Infection rate

This is what this page is about. I try to estimate this number or better to calculate some possible boundaries for this number.

I started doing this project for myself. I didn't intend to publish it. Everyday the new data on the corona virus cases rains down on me by radio and TV. Day by day and in many cases only comparisons with the previous day. So I started collecting the data on a day to day basis.

But I am sure that I made you curious. You want to know now this "dark" number, the number of total cases, known plus unknown? What we hear about the severity of the disease, i.e. the death rate is also just a calculation on the known cases. So we have the division of the number of people who died due to the virus and the number of the known cases. The death rate given and based on known cases varies between 2.5 % and 5 %. The higher the number of unknown cases is the smaller the real death rate will be. I assumed that the "real" death rate might be 0.7, like a seasonal flu. Be aware: This is just an assumption, not covered by any scientific data, but anyway quite likely. I further assume that the avarage duration of an infection is about 18 days, the total number of infections in Germany on the 18th of March is about 280000 cases compared to 12327 cases. This means about 23 times higher than the known figure!

If you like to follow my calculations based on Python and Pandas, you can continue reading here:

import pandas as pd

xf = pd.ExcelFile("/data1/coronacases.xlsx")
df = xf.parse("Sheet1", 
         skiprows=2, 
         index_col=0,
         names=["dates", "cases", "active", "deaths"])

df[["cases", "active", "deaths"]].plot()

OUTPUT:

data image

<matplotlib.axes._subplots.AxesSubplot at 0x7f75331394d0>

The number of cases we here on the media corresponds to the accumulated day by day newly infected people. The number of "active" cases is the number without those who are healed or unfortunately died. The following number shows the ratio of those. As long as nobody had died, the death ratio was 0 %.

discharged = df["cases"] - df["active"]

if "healed" in df.columns:
    df.drop("healed", axis=1, inplace=True)
if "healed" not in df.columns:
    df.insert(loc=len(df.columns),
              column="healed",
              value=(discharged -df["deaths"]))

healed_ratio = df["healed"] * 100 / discharged
healed_ratio
death_ratio = df["deaths"] * 100 / discharged
death_ratio

outcome = {"death_ratio": death_ratio, "healed_ratio": healed_ratio}
outcome_df = pd.DataFrame(outcome)
outcome_df
outcome_df.plot()

OUTPUT:

data image

<matplotlib.axes._subplots.AxesSubplot at 0x7f752e6a4c90>

In the following code, we do some curse fitting. We fit the value to the function

$$f = a \cdot e^{b \cdot x} + c$$

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def growth_func(x, a, b, c):
  return a * np.exp(b * x) + c
  
Y = df.cases.values
X = np.arange(0, len(Y))
popt, pcov = curve_fit(growth_func, X, Y)

def growth(x):
    return growth_func(x, *popt)

plt.figure()
plt.plot(X, Y, 'ko', label="Original Data")
plt.plot(X, growth(X), 'r-', label="Fitted Curve")
plt.legend()
plt.show()

#for i in range(1, 30):
#    print(df.cases.values[i], growth(i))

data image

In the following code, I make some crucial assumption. Both by rule of thumb and not properly based on scientific data, because it is not available. The average duration of an infection 'days_infected_before_outcome' might be less or higher. Furthermore I made a deliberate decision to assume the real death rate, based on known and unknown cases, is 0.7. This number corresponds to the seasonal flu.

days_infected_before_outcome = 14
assumed_real_death_rate = 2.9

def create_inverse_growth_func(a, b, c):
    def inverse(x):
        return np.log((x - c) / a) / b
    return inverse

inverse_growth = create_inverse_growth_func(*popt)

# number of cases 'days_infected_before_outcome':
cases_days_infection_before = df["deaths"][-1] * 100 / assumed_real_death_rate 
shift_days = inverse_growth(cases_days_infection_before)
shift_days -= (len(df) - days_infected_before_outcome)
print(shift_days)

OUTPUT:

5.765157139549551

plt.figure()
X = np.arange(0, 65)
plt.plot(X, 
         growth_func(X + days_infected_before_outcome, *popt), 
         'r-', 
         label="Fitted Curve")
plt.legend()
plt.show()

data image

print(len(df["cases"]))
x = np.arange(0, len(df["cases"]))
print(len(x))
if "real" in df.columns:
    df.drop("real", axis=1, inplace=True)
df.insert(loc=len(df.columns),
          column="real",
          value=growth(x+shift_days).astype(np.int))
df

OUTPUT:

	cases	active	deaths	healed	real
dates
2020-02-17	16	9	0	7	3659
2020-02-18	16	7	0	9	3659
2020-02-19	16	7	0	9	3659
2020-02-20	16	3	0	13	3659
2020-02-21	16	2	0	14	3659
2020-02-22	16	2	0	14	3659
2020-02-23	16	2	0	14	3659
2020-02-24	16	2	0	14	3659
2020-02-25	18	3	0	15	3659
2020-02-26	26	11	0	15	3659
2020-02-27	18	32	0	-14	3659
2020-02-28	74	58	0	16	3659
2020-02-29	79	63	0	16	3659
2020-03-01	130	114	0	16	3659
2020-03-02	165	149	0	16	3659
2020-03-03	203	187	0	16	3659
2020-03-04	262	246	0	16	3659
2020-03-05	545	528	0	17	3659
2020-03-06	670	652	0	18	3659
2020-03-07	800	782	0	18	3659
2020-03-08	1040	1022	0	18	3659
2020-03-09	1224	1204	2	18	3659
2020-03-10	1565	1545	2	18	3659
2020-03-11	1966	1938	3	25	3659
2020-03-12	2745	2714	6	25	3659
2020-03-13	3675	3621	8	46	3660
2020-03-14	4599	4544	8	47	3662
2020-03-15	5813	5754	13	46	3666
2020-03-16	7272	7188	17	67	3679
2020-03-17	9367	9274	26	67	3713
2020-03-18	12327	12194	28	105	3805
2020-03-19	15320	15161	44	115	4057
2020-03-20	19848	19668	59	121	4741
2020-03-21	22364	22071	84	209	6600
2020-03-22	24873	24513	94	266	11653
2020-03-23	29056	28480	123	453	25388
2020-03-24	32986	29586	157	3243	62724
2020-03-25	37323	33570	206	3547	164213
2020-03-26	43938	37998	267	5673	440091
2020-03-27	50871	43862	351	6658	1190003
2020-03-28	57695	48781	433	8481	3228476
2020-03-29	62095	52359	525	9211	8769619
2020-03-30	66711	52566	645	13500	23832009
2020-03-31	70985	54479	682	15824	64775829
2020-04-01	77981	58350	931	18700	176072669
2020-04-02	84794	61247	1107	22440	478608844
2020-04-03	91159	65309	1275	24575	1300987426
2020-04-04	96108	68262	1446	26400	3536444161
2020-04-05	100123	69839	1584	28700	9613045529
2020-04-06	103375	72865	1819	28691	26130960462

50 50

df[["cases", "real"]].plot()

OUTPUT:

data image

<matplotlib.axes._subplots.AxesSubplot at 0x7f752d00e890>

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Upcoming online Courses

Python for Engineers and Scientists

Data Analysis With Python

Enrol here