Synthea COVID-19 Module Analysis

This notebook provides and analysis of data generated by Synthea's COVID-19 module. Analysis is run on the CSV output from Synthea.

Code in this notebook depends on Pandas, NumPy, matplotlib and seaborn.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import datetime
In [2]:
%matplotlib inline
In [3]:
%config InlineBackend.figure_format = 'retina'
In [4]:
%load_ext autoreload
In [5]:
%autoreload 1

This jupyter notebook evolved over the creation of the Synthea COVID-19 Module. Initially, all python code was in the notebook. Eventually, this became unwieldy and code was moved into the imported file below.

In [6]:
%aimport analysis

Read in all of the data

In [7]:
conditions = pd.read_csv("conditions.csv")
patients = pd.read_csv("patients.csv")
observations = pd.read_csv("observations.csv")
care_plans = pd.read_csv("careplans.csv")
encounters = pd.read_csv("encounters.csv")
devices = pd.read_csv("devices.csv")
supplies = pd.read_csv('supplies.csv')
procedures = pd.read_csv("procedures.csv")
medications = pd.read_csv("medications.csv")

Grab the IDs of patients that have been diagnosed with COVID-19

In [8]:
covid_patient_ids = conditions[conditions.CODE == 840539006].PATIENT.unique()

This grabs every patient with a negative SARS-CoV-2 test. This will include patients who tested negative up front as well as patients that tested negative after leaving the hospital

In [9]:
negative_covid_patient_ids = observations[(observations.CODE == '94531-1') & (observations.VALUE == 'Not detected (qualifier value)')].PATIENT.unique()

Grabs IDs for all patients that died in the simulation. This will be more than just COVID-19 deaths.

In [10]:
deceased_patients = patients[patients.DEATHDATE.notna()].Id

Grabs IDs for patients that have completed the care plan for isolation at home.

In [11]:
completed_isolation_patients = care_plans[(care_plans.CODE == 736376001) & (care_plans.STOP.notna()) & (care_plans.REASONCODE == 840539006)].PATIENT

Survivors are the union of those who have completed isolation at home or have a negative SARS-CoV-2 test.

In [12]:
survivor_ids = np.union1d(completed_isolation_patients, negative_covid_patient_ids)

Grab IDs for patients with admission due to COVID-19

In [13]:
inpatient_ids = encounters[(encounters.REASONCODE == 840539006) & (encounters.CODE == 1505002)].PATIENT

The number of inpatient survivors

In [14]:
np.intersect1d(inpatient_ids, survivor_ids).shape
Out[14]:
(1522,)

The number of inpatient non-survivors

In [15]:
np.intersect1d(inpatient_ids, deceased_patients).shape
Out[15]:
(349,)
In [16]:
inpatient_ids.shape
Out[16]:
(1867,)

Health outcomes experienced by COVID-19 patients

The following table shows different health outcomes that were experienced by COVID-19 patients during the course of the disease.

In [17]:
analysis.outcome_table(inpatient_ids, survivor_ids, deceased_patients, conditions)
Out[17]:
outcome total percent of inpatient survivors percent survivors non survivors percent non survivors
0 Sepsis 684 0.365580 339 0.222733 345 0.988539
1 Respiratory Failure 861 0.460182 520 0.341656 341 0.977077
2 ARDS 234 0.125067 8 0.005256 226 0.647564
3 Heart Failure 145 0.077499 18 0.011827 127 0.363897
4 Septic Shock 168 0.089792 0 0.000000 168 0.481375
5 Coagulopathy 127 0.067878 11 0.007227 116 0.332378
6 Acute Cardiac Injury 137 0.073223 4 0.002628 133 0.381089
7 Acute Kidney Injury 118 0.063068 2 0.001314 116 0.332378

Lab values for COVID-19 patients

The following code presents lab values taken for COVID-19 patients. Values are separated into survivors and non survivors.

The first block of code selects lab values of interest from all observations in the simulation.

In [18]:
lab_obs = observations[(observations.CODE == '48065-7') | (observations.CODE == '26881-3') | 
                          (observations.CODE == '2276-4') | (observations.CODE == '89579-7') |
                          (observations.CODE == '2532-0') | (observations.CODE == '731-0') |
                          (observations.CODE == '14804-9')
                      ]

Select COVID-19 conditions out of all conditions in the simulation

In [19]:
covid_conditions = conditions[conditions.CODE == 840539006]

Merge the COVID-19 conditions with the patients

In [20]:
covid_patients = covid_conditions.merge(patients, how='left', left_on='PATIENT', right_on='Id')

Add an attribute to the DataFrame indicating whether this is a survivor or not.

In [21]:
covid_patients['survivor'] = covid_patients.PATIENT.isin(survivor_ids)

Reduce the columns on the DataFrame to ones needed

In [22]:
covid_patients = covid_patients[['START', 'PATIENT', 'survivor', 'CODE']]

Calculate attributes needed to support the plot. Also coerce all lab values into a numeric data type.

In [23]:
covid_patients_obs = covid_patients.merge(lab_obs, on='PATIENT')
covid_patients_obs['START'] = pd.to_datetime(covid_patients_obs.START)
covid_patients_obs['DATE'] = pd.to_datetime(covid_patients_obs.DATE)
covid_patients_obs['lab_days'] = covid_patients_obs.DATE - covid_patients_obs.START
covid_patients_obs['days'] = covid_patients_obs.lab_days / np.timedelta64(1, 'D')
covid_patients_obs['VALUE'] = pd.to_numeric(covid_patients_obs['VALUE'], errors='coerce')
In [24]:
loinc_to_display = {'CODE_y = 48065-7': 'D-dimer', 'CODE_y = 2276-4': 'Serum Ferritin',
                    'CODE_y = 89579-7': 'High Sensitivity Cardiac Troponin I',
                    'CODE_y = 26881-3': 'IL-6', 'CODE_y = 731-0': 'Lymphocytes',
                    'CODE_y = 14804-9': 'Lactate dehydrogenase'}
catplt = sns.catplot(x="days", y="VALUE", hue="survivor", kind="box", col='CODE_y', 
            col_wrap=2, sharey=False, sharex=False, data=covid_patients_obs, palette=["C1", "C0"])

for axis in catplt.fig.axes:
    axis.xaxis.set_major_formatter(ticker.FormatStrFormatter('%d'))
    axis.xaxis.set_major_locator(ticker.MultipleLocator(base=4))
    axis.set_title(loinc_to_display[axis.title.get_text()])
        
plt.show()
In [25]:
loinc_to_display = {'CODE_y = 48065-7': 'D-dimer', 'CODE_y = 2276-4': 'Serum Ferritin',
                    'CODE_y = 89579-7': 'High Sensitivity Cardiac Troponin I',
                    'CODE_y = 26881-3': 'IL-6', 'CODE_y = 731-0': 'Lymphocytes',
                    'CODE_y = 14804-9': 'Lactate dehydrogenase'}
catplt = sns.catplot(x="days", y="VALUE", hue="survivor", kind="point", col='CODE_y', 
            col_wrap=2, sharey=False, sharex=False, data=covid_patients_obs, palette=["C1", "C0"])

for axis in catplt.fig.axes:
    axis.xaxis.set_major_formatter(ticker.FormatStrFormatter('%d'))
    axis.xaxis.set_major_locator(ticker.MultipleLocator(base=4))
    axis.set_title(loinc_to_display[axis.title.get_text()])

plt.show()

Set up a new DataFrame with boolean columns representing various outcomes, like admit, recovery or death

In [26]:
cp = covid_conditions.merge(patients, how='left', left_on='PATIENT', right_on='Id')
isolation_ids = care_plans[(care_plans.CODE == 736376001) & (care_plans.REASONCODE == 840539006)].PATIENT
cp['isolation'] = cp.Id.isin(isolation_ids)
cp['admit'] = cp.Id.isin(inpatient_ids)
cp['recovered'] = cp.Id.isin(survivor_ids)
cp['death'] = cp.DEATHDATE.notna()
icu_ids = encounters[encounters.CODE == 305351004].PATIENT
cp['icu_admit'] = cp.Id.isin(icu_ids)
vent_ids = procedures[procedures.CODE == 26763009].PATIENT
cp['ventilated'] = cp.Id.isin(vent_ids)

Outcomes for all COVID-19 Patients

This code builds a new DataFrame for the purposes of display. The DataFrame contains the percentages of patients that experience a particular outcome. Percentages are then provided for only hospitalized patients, ICU admitted patients and ventilated patients.

In [27]:
hospitalized = (cp.admit == True)
icu = (cp.icu_admit == True)
vent = (cp.ventilated == True)
covid_count = cp.Id.size
row_filters = {'Home Isolation': (cp.isolation == True), 'Hospital Admission': hospitalized, 'ICU Admission': icu,
 'Ventilated': vent, 'Recovered': (cp.recovered == True), 'Death': (cp.death == True)}

table_rows = []
for category, row_filter in row_filters.items():
    row = {'Outcome': category}
    row['All Patients'] = cp[row_filter].Id.size / covid_count
    row['Hospitalized'] = cp[row_filter & hospitalized].Id.size / hospitalized.value_counts()[True]
    row['ICU Admitted'] = cp[row_filter & icu].Id.size / icu.value_counts()[True]
    row['Required Ventilation'] = cp[row_filter & vent].Id.size / vent.value_counts()[True]
    table_rows.append(row)
    
pd.DataFrame.from_records(table_rows)
Out[27]:
Outcome All Patients Hospitalized ICU Admitted Required Ventilation
0 Home Isolation 0.795238 0.032673 0.026667 0.024390
1 Hospital Admission 0.211678 1.000000 1.000000 1.000000
2 ICU Admission 0.042517 0.200857 1.000000 1.000000
3 Ventilated 0.032540 0.153723 0.765333 1.000000
4 Recovered 0.960658 0.815212 0.357333 0.163763
5 Death 0.040476 0.186931 0.645333 0.836237

Outcomes for ICU Admitted Patients

Essentially a sub table from above, looking only at ICU patients.

In [28]:
icu_only = cp[cp.icu_admit == True]

vent = (icu_only.ventilated == True)
covid_count = icu_only.Id.size
row_filters = {'Ventilated': vent, 'Recovered': (icu_only.recovered == True), 'Death': (icu_only.death == True)}

table_rows = []
for category, row_filter in row_filters.items():
    row = {'Outcome': category}
    row['ICU Admitted'] = icu_only[row_filter].Id.size / covid_count
    row['Required Ventilation'] = icu_only[row_filter & vent].Id.size / vent.value_counts()[True]
    table_rows.append(row)
    
pd.DataFrame.from_records(table_rows)
Out[28]:
Outcome ICU Admitted Required Ventilation
0 Ventilated 0.765333 1.000000
1 Recovered 0.357333 0.163763
2 Death 0.645333 0.836237

Start to build a DataFrame that we can use to look at other conditions in relation to COVID-19

In [29]:
covid_info = cp[['PATIENT', 'recovered', 'death', 'START', 'DEATHDATE', 'BIRTHDATE', 'GENDER', 'admit', 'icu_admit']]
In [30]:
covid_info = covid_info.rename(columns={'START': 'covid_start'})

Grab all of the conditions starting after January 20, 2020. This is a hack to get only conditions that are related to COVID-19. We will end up merging these with the COVID patients.

In [31]:
covid_related_conditions = conditions[pd.to_datetime(conditions.START) > pd.to_datetime('2020-01-20')]

This DataFrame will contain all conditions for COVID-19 patients, where START can be compared to covid_start to see how long after the COVID-19 diagnosis something happened.

In [32]:
covid_patient_conditions = covid_info.merge(covid_related_conditions, on='PATIENT')

Symptoms for all COVID-19 Patients

Generates a DataFrame with percentages of co-occurring conditions

In [33]:
analysis.symptom_table(covid_patient_conditions)
Out[33]:
Symptoms All Patients Percentage All Patients Count Survivor Percentage Survivor Count Non Survivor Percentage Non Survivor Count
0 Conjunctival Congestion 0.009298 82 0.008144 69 0.039216 14
1 Nasal Congestion 0.048418 427 0.049221 417 0.028011 10
2 Headache 0.139925 1234 0.139164 1179 0.154062 55
3 Cough 0.682504 6019 0.682956 5786 0.663866 237
4 Sore Throat 0.142420 1256 0.142705 1209 0.134454 48
5 Sputum Production 0.325207 2868 0.324245 2747 0.341737 122
6 Fatigue 0.385871 3403 0.384915 3261 0.411765 147
7 Hemoptysis 0.010772 95 0.010269 87 0.022409 8
8 Shortness of Breath 0.201383 1776 0.195703 1658 0.338936 121
9 Nausea 0.054768 483 0.054060 458 0.070028 25
10 Diarrhea 0.037759 333 0.036001 305 0.081232 29
11 Muscle Pain 0.148656 1311 0.147899 1253 0.168067 60
12 Joint Pain 0.148656 1311 0.147899 1253 0.168067 60
13 Chills 0.113278 999 0.113196 959 0.114846 41
14 Loss of Taste 0.515478 4546 0.515463 4367 0.518207 185

Symptoms for ICU admitted COVID-19 Patients

DataFrame limited to individuals admitted to the ICU

In [34]:
analysis.symptom_table(covid_patient_conditions, True)
Out[34]:
Symptoms All Patients Percentage All Patients Count Survivor Percentage Survivor Count Non Survivor Percentage Non Survivor Count
0 Conjunctival Congestion 0.037333 14 0.029851 4 0.041322 10
1 Nasal Congestion 0.026667 10 0.014925 2 0.033058 8
2 Headache 0.168000 63 0.171642 23 0.165289 40
3 Cough 0.704000 264 0.783582 105 0.661157 160
4 Sore Throat 0.138667 52 0.134328 18 0.140496 34
5 Sputum Production 0.360000 135 0.350746 47 0.363636 88
6 Fatigue 0.381333 143 0.380597 51 0.384298 93
7 Hemoptysis 0.021333 8 0.014925 2 0.024793 6
8 Shortness of Breath 0.376000 141 0.462687 62 0.330579 80
9 Nausea 0.074667 28 0.067164 9 0.078512 19
10 Diarrhea 0.053333 20 0.014925 2 0.074380 18
11 Muscle Pain 0.173333 65 0.171642 23 0.173554 42
12 Joint Pain 0.173333 65 0.171642 23 0.173554 42
13 Chills 0.133333 50 0.149254 20 0.123967 30
14 Loss of Taste 0.512000 192 0.529851 71 0.500000 121

Create a DataFrame with columns that show a condition's start and end in days relative to COVID-19 diagnosis. Also create a column that calculates the number of days between COVID-19 diagnosis and a person's death.

In [35]:
covid_patient_conditions['start_days'] = (pd.to_datetime(covid_patient_conditions.START) - pd.to_datetime(covid_patient_conditions.covid_start)) / np.timedelta64(1, 'D')
covid_patient_conditions['end_days'] = (pd.to_datetime(covid_patient_conditions.STOP) - pd.to_datetime(covid_patient_conditions.covid_start)) / np.timedelta64(1, 'D')
covid_patient_conditions['death_days'] = (pd.to_datetime(covid_patient_conditions.DEATHDATE) - pd.to_datetime(covid_patient_conditions.covid_start)) / np.timedelta64(1, 'D')

Timelines for hospitalized patients

These plots show the progression of COVID-19 related complications in hospitalized patients. The bars represent the average start and end time for the particular item. Arrows at the bottom show the time or start time of a particular item.

In [36]:
averages = analysis.select_condition_averages(covid_patient_conditions, ((covid_patient_conditions.recovered == True) &
                                                               (covid_patient_conditions.admit == True)))
analysis.survivor_timeline_plot(encounters, devices, averages, covid_patient_conditions, covid_info)
In [37]:
averages = analysis.select_condition_averages(covid_patient_conditions, ((covid_patient_conditions.death == True) &
                                                               (covid_patient_conditions.admit == True)))
analysis.non_survivor_timeline_plot(encounters, devices, averages, covid_patient_conditions, covid_info)

Timelines for ICU only patients

These plots refine the timelines to only patients with ICU admissions

In [38]:
averages = analysis.select_condition_averages(covid_patient_conditions, ((covid_patient_conditions.recovered == True) &
                                                               (covid_patient_conditions.icu_admit == True)))
analysis.survivor_timeline_plot(encounters, devices, averages, covid_patient_conditions, covid_info, True)
In [39]:
averages = analysis.select_condition_averages(covid_patient_conditions, ((covid_patient_conditions.death == True) &
                                                               (covid_patient_conditions.icu_admit == True)))
analysis.non_survivor_timeline_plot(encounters, devices, averages, covid_patient_conditions, covid_info)

Add an age column to the DataFrame for rows where the patient has died

In [40]:
covid_info.loc[covid_info.death == True, 'age'] = (pd.to_datetime(covid_info.DEATHDATE) - pd.to_datetime(covid_info.BIRTHDATE)) / np.timedelta64(1, 'Y')

Populate ages for survivors based on the current date

In [41]:
covid_info.loc[covid_info.recovered == True, 'age'] = (datetime.datetime.now() - pd.to_datetime(covid_info.BIRTHDATE)) / np.timedelta64(1, 'Y')

Create an age_range column that places individuals into 10 year age ranges, such as 0 - 10, 10 - 20, etc.

In [42]:
bins = list(range(0, 120, 10))
covid_info['age_range'] = pd.cut(covid_info.age, bins=bins)

Mortality by Age and Sex

A plot of deaths grouped by age range and gender.

In [43]:
chart = sns.catplot(x="age_range", kind="count", hue="GENDER", data=covid_info[covid_info.death==True]);
for axes in chart.axes.flat:
    axes.set_xticklabels(axes.get_xticklabels(), rotation=90)

A table view of the same information from above

In [44]:
covid_info[covid_info.death==True].groupby(['age_range', 'GENDER']).count()[['PATIENT']]
Out[44]:
PATIENT
age_range GENDER
(0, 10] F NaN
M NaN
(10, 20] F NaN
M NaN
(20, 30] F 2.0
M 8.0
(30, 40] F 4.0
M 7.0
(40, 50] F 4.0
M 6.0
(50, 60] F 19.0
M 28.0
(60, 70] F 25.0
M 36.0
(70, 80] F 48.0
M 50.0
(80, 90] F 21.0
M 44.0
(90, 100] F 13.0
M 13.0
(100, 110] F 17.0
M 11.0

Another table view of the mortality data, this time just grouped by age range

In [45]:
covid_info[covid_info.death==True].groupby(['age_range']).count()[['PATIENT']]
Out[45]:
PATIENT
age_range
(0, 10] 0
(10, 20] 0
(20, 30] 10
(30, 40] 11
(40, 50] 10
(50, 60] 47
(60, 70] 61
(70, 80] 98
(80, 90] 65
(90, 100] 26
(100, 110] 28

Build a DataFrame that shows the total count of a supply used on a given day

In [46]:
grouped_supplies = supplies.groupby(['DESCRIPTION', 'DATE']).sum()

Supply Usage

Small multiples plot of supply usage over time.

In [47]:
gs = grouped_supplies.reset_index()
gs['DATE'] = pd.to_datetime(gs.DATE)
g = sns.FacetGrid(gs, col="DESCRIPTION", col_wrap=3, sharey=False, height=3, aspect=2)
g = g.map(sns.lineplot, "DATE", "QUANTITY", marker=".")
for axes in g.axes.flat:
    title = axes.get_title()
    if 'glove' in title:
        axes.set_title('Gloves')
    else:    
        axes.set_title(title.replace("DESCRIPTION = ", "").replace(" (physical object)", ""))
    for tick in axes.get_xticklabels():
        tick.set_rotation(90)

A table showing total supplies used over the entire simulation

In [48]:
supplies.groupby(['DESCRIPTION']).sum()[['QUANTITY']]
Out[48]:
QUANTITY
DESCRIPTION
Alcohol disinfectant (substance) 25337
Antiseptic towelette (physical object) 200687
Basic endotracheal tube single-use (physical object) 287
Carbon dioxide breath analyzer (physical object) 287
Disposable air-purifying respirator (physical object) 46015
Endotracheal tube holder (physical object) 287
Endotracheal tube stylet single-use (physical object) 287
Face shield (physical object) 45154
Human plasma blood product (product) 9
Isolation gown reusable (physical object) 4946
Isolation gown single-use (physical object) 266552
Laryngoscope blade single-use (physical object) 287
Lubricant (physical object) 287
Nasogastric tube device (physical object) 287
Nitrile examination/treatment glove non-powdered sterile (physical object) 593604
Operating room gown single-use (physical object) 287
Protective glasses device (physical object) 861
Suction system (physical object) 287
Surgical cap single-use (physical object) 287
Syringe device (physical object) 574
Viral filter (physical object) 287

Build a DataFrame that has cumulative case counts over time

In [49]:
case_counts = conditions[conditions.CODE == 840539006].groupby('START').count()[['PATIENT']]
case_counts['total'] = case_counts['PATIENT'].cumsum()
case_counts = case_counts.rename(columns={'PATIENT': 'daily'})
case_counts = case_counts.reset_index()
case_counts['START'] = pd.to_datetime(case_counts.START)

Cumulative Case Count

Show total cases over time

In [50]:
axes = sns.lineplot(x='START', y='total', data=case_counts)
plt.xticks(rotation=90)
plt.show()

Medication Dispenses

This table shows medications dispensed to patients with COVID-19 since January 20, 2020.

In [51]:
covid_meds = medications[pd.to_datetime(medications.START) > pd.to_datetime('2020-01-20')]
covid_meds = covid_info.merge(covid_meds, on='PATIENT')
In [52]:
covid_meds.groupby(['DESCRIPTION']).sum()[['DISPENSES']].sort_values('DISPENSES', ascending=False).head(10)
Out[52]:
DISPENSES
DESCRIPTION
Acetaminophen 500 MG Oral Tablet 3693
0.4 ML Enoxaparin sodium 100 MG/ML Prefilled Syringe 3610
1 ML Enoxaparin sodium 150 MG/ML Prefilled Syringe 2110
NDA020503 200 ACTUAT Albuterol 0.09 MG/ACTUAT Metered Dose Inhaler 1819
1 ML Epoetin Alfa 4000 UNT/ML Injection [Epogen] 794
Hydrochlorothiazide 25 MG Oral Tablet 719
insulin human isophane 70 UNT/ML / Regular Insulin Human 30 UNT/ML Injectable Suspension [Humulin] 623
Simvastatin 10 MG Oral Tablet 591
amLODIPine 5 MG / Hydrochlorothiazide 12.5 MG / Olmesartan medoxomil 20 MG Oral Tablet 508
Atenolol 50 MG / Chlorthalidone 25 MG Oral Tablet 484

Hospital Day Stats

For patients with COVID-19, calculate the average hospital length of stay as well as total hospital days for all COVID-19 patients. Provide the same information for ICU patients

In [53]:
covid_hosp = analysis.create_covid_hosp(covid_info, encounters, {'admit': True}).reset_index()
covid_icu = analysis.create_covid_icu(covid_info, encounters)
pd.DataFrame.from_records([
    {
        'type': 'inpatient',
        'patients': covid_hosp.PATIENT.nunique(),
        'average stay': ((covid_hosp.STOP - covid_hosp.START) / np.timedelta64(1, 'D')).mean(),
        'total days': ((covid_hosp.STOP - covid_hosp.START) / np.timedelta64(1, 'D')).sum(),
    },
    {
        'type': 'ICU',
        'patients': covid_icu.PATIENT.nunique(),
        'average stay': ((pd.to_datetime(covid_icu.STOP) - pd.to_datetime(covid_icu.START, utc=True)) / np.timedelta64(1, 'D')).mean(),
        'total days': ((pd.to_datetime(covid_icu.STOP) - pd.to_datetime(covid_icu.START, utc=True)) / np.timedelta64(1, 'D')).sum(),
    }
])
Out[53]:
type patients average stay total days
0 inpatient 1867 15.145745 28277.105556
1 ICU 375 6.088204 2283.076389
In [54]:
device_codes = [448907002, 449071006, 36965003]
grouped_dev = devices[devices.CODE.isin(device_codes)].groupby(['DESCRIPTION', 'START']).count()
grouped_dev = grouped_dev.reset_index()
grouped_dev['START'] = pd.to_datetime(grouped_dev.START)

Device Usage

Show the number of devices used to treat COVID-19 over time.

In [55]:
g = sns.FacetGrid(grouped_dev.reset_index(), col="DESCRIPTION", col_wrap=3, sharey=False, height=3, aspect=2)
g = g.map(sns.lineplot, "START", "PATIENT", marker=".")
for axes in g.axes.flat:
    title = axes.get_title()
    axes.set_title(title.replace("DESCRIPTION = ", "").replace(" (physical object)", ""))
    for tick in axes.get_xticklabels():
        tick.set_rotation(90)
In [ ]: