MIMIC-III and eICU Data Processing using Google BigQuery

December 14, 2021

In this article, I will take you through how Google BigQuery can be used to perform typical data processing steps preceding any machine learning model training and fitting. As an example, we will use the medical use case of training a model to predict cardiac arrest on ICU patients.

The project partner: The use case in this case study stems from seed-stage startup Transformative.ai who hosted an Omdena Challenge as part of Omdena´s AI Incubator for impact startups.

Background on the cardiac prediction data challenge

SCA (sudden cardiac arrest) is a medical emergency in which the heart suddenly stops beating, killing the patient within minutes. Survival rates for SCA are <25% within hospitals. SCA can be prevented if the underlying cause is identified and treated. The purpose of the project was to expand the cardiac arrest prediction algorithm to pulseless electrical activity and asystole, providing an all-cause cardiac arrest prediction algorithm for more than 90% of patients. More details on this challenge are available here.

Data needed for building cardiac arrest prediction models were sourced from 2 sources, MIMIC-III, and eICU.

MIMIC-III (‘Medical Information Mart for Intensive Care’) is a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital.
eICU is a collaborative research database and is populated with data from a combination of many critical care units throughout the continental United States. The data in the collaborative database covers patients who were admitted to critical care units in 2014 and 2015.

Access to both of these databases can be obtained via the physionet website. Once access is received, the data is available for querying in the bigquery datasets as shown in figure 1 below.

Figure 1: mimic-iii and eICU datasets in google big query (after access is received).

Figure 2 below covers typical data processing steps before fitting a machine learning model to the data. These steps involve extracting, cleaning the data, imputation of missing values, shaping the data, and performing feature engineering. Most of the time the data processing is done in python using pandas, NumPy, etc. Here we perform all data processing steps using Google BigQuery SQL. I will cover each of these steps in detail below.

Figure 2: Data processing steps.

For the ease of understanding the queries covered in this article, dataset and table names are provided for reference in tables 1 and 2 below.

Dataset	Description
mimiciii_clinical	Mimic_iii clinical dataset.
eicu_crd	eICU dataset.
mimic_iii_staging	Staging dataset created for data processing for mimic_iii data.
eICU_staging	Staging dataset created for data processing for mimic_iii data.

Table 1: Dataset names used in the article.

Table	Description
feature_lookup	Reference table listing item codes and vital names to be extracted. Also, contain if the item code reading needs to be standardised with respect to the unit of measurement.
variable_range	Reference table containing valid range for each vital.
vital_std	Intermediate table containing data after the vital readings are converted into the same unit of measurement.
vital_outlier	Intermediate table containing data after the outliers are handled for the vital readings.
vital_aggs	Intermediate table containing aggregated vital readings.
vital_aggs_pivot	Intermediate table containing pivoted vital readings.
vital_imputed	Intermediate table containing imputed vital readings.

Table 2: Table names used in the article.

1. Extracting Data

The first step of any data processing step is to extract data from the source system. This can be achieved easily using SQL since both MIMIC-III and eICU data are hosted in Google BigQuery. One of the key features that the cardiac arrest prediction model would need would be the patient’s vitals i.e. systolic blood pressure, diastolic blood pressure, temperature, heart rate, etc.

Let’s first extract vitals from both the source systems using SQL. For simplicity, only heart rate vital is extracted by the below query, and results are presented in Figures 3 and 4 below. An ICU stay is unique and is identified by an ICU stay id in the data source.

-- heart rate for every ICU Stay for a patient from MIMIC-III
select
 ce.icustay_id -- icu stay id
,ce.charttime -- date/time when reading was taken
,di.label -- vital name
,ce.value as vital_reading -- vital reading
from
`physionet-data.mimiciii_clinical.chartevents` ce
join `physionet-data.mimiciii_clinical.d_items` di
ON ce.itemid = di.itemid
where lower(di.LABEL) = 'heart rate'
limit 10;

Figure 3: Heart rate vital extracted from the mimic-iii data source.

-- heart rate for every ICU Stay for a patient from eICU
select
 d.patientunitstayid -- icu stay id
,d.nursingchartoffset -- date/time offset when reading was taken
,lower(d.nursingchartcelltypevallabel) as vital_name -- vital name
,d.nursingchartvalue as vital_reading -- vital reading
from `physionet-data.eicu_crd.nursecharting` d
where lower(d.nursingchartcelltypevallabel) = 'heart rate'
and lower(d.nursingchartcelltypevalname) = 'heart rate'
limit 10;

Figure 4: Heart rate vital extracted from eICU data source.

In order to extract all the required vitals using the same query used above but without writing each vital name explicitly, we build a simple reference table called “feature_lookup”, which contains a list of all vitals we want to extract from sources. For the scope of this article, we focus on 7 key vitals – heart rate, systolic & diastolic blood pressure, temperature, Glasgow coma scale, respiratory rate, and oxygen saturation. This feature_lookup reference table can be built manually using a .csv file or google sheet and uploaded into the big query dataset mimic_iii_staging.

A peek into the feature lookup CSV file is shown below in figure 5 for the mimic-iii data source. The feature_lookup table contains the item_codes for each vital label in the mimic-iii database as well as whether the vital label may be needed to convert into a single unit of measurement. A similar one can be built for eICU data too.

Figure 5: Sample feature lookup reference table for mimic-iii.

Now we are ready to extract all the 7 vitals as specified in the feature_lookup. We rewrite our above SQL query using the feature_lookup table and remove the hard-coded where clause included in previous queries.

-- all vitals for every ICU Stay for a patient from MIMIC-III
select
ce.icustay_id -- icu stay id
,ce.charttime -- date/time when reading was taken
,di.label -- vital name
,ce.value as vital_reading -- vital reading
from
`physionet-data.mimiciii_clinical.chartevents` ce
join `physionet-data.mimiciii_clinical.d_items` di
ON ce.itemid = di.itemid
join `mimic_iii_staging.feature_lookup` fl
on di.itemid = fl.item_code
where lower(di.LABEL) = 'heart rate'
limit 10;

Figure 6: Vitals data from the mimic-iii data source.

-- all vitals for every ICU Stay for a patient from eICU
select
d.patientunitstayid -- icu stay id
,d.nursingchartoffset -- date/time offset when reading was taken
,lower(d.nursingchartcelltypevallabel) as vital_name -- vital name
,d.nursingchartvalue as vital_reading -- vital reading
from `physionet-data.eicu_crd.nursecharting` d
join `eICU_staging.feature_lookup` f
on d.nursingchartcelltypevallabel = f.string_field_0
and d.nursingchartcelltypevalname = f.string_field_1
where lower(d.nursingchartcelltypevallabel) = 'heart rate'
and lower(d.nursingchartcelltypevalname) = 'heart rate'
limit 10;

Figure 7: Vitals data from eICU data source.

Unlike mimic-iii data, the eICU data do not contain chart time but instead contain an offset which is an offset in minutes from the ICU admit time. We can handle this by creating a dummy chart time to make it consistent with mimic-iii data. For brevity purposes, I have covered queries for mimic-iii data only, but the queries for eICU would be very similar to the mimic-iii queries.

2. Cleaning Data

Next, we will focus on cleaning the data. This step involves standardising the vitals in a single unit of measurement and handling outliers.

2.1 Standardisation Of Unit

Some of the vitals in the source data were captured in different units, for example, capturing temperature is stored in Fahrenheit vs in Celsius.

As seen in section 1, feature_lookup table already specifies which vitals and measurement units need to be standardized and we leverage that in the below query for mimic-iii. Using a SQL case statement, if the vital needs to be converted we apply the necessary formula for converting the value from one unit to another. Once the vitals are standardized, we store the output results in the intermediate table vital_std, so the downstream processing can refer to this table.

-- standardisation of vitals in mimic-iii data

create or replace table mimic_iii_staging.vital_std

as

SELECT

ce.icustay_id -- icu stay id

,ce.charttime -- date/time when reading was taken

,fl.level_0 as vital_name -- vital name

,ce.value as vital_reading -- vital reading

,case

when conversion_unit = 'celsius' and SAFE_CAST(REGEXP_REPLACE(ce.value,'[^0-9.]','') AS FLOAT64) > 79 then (SAFE_CAST(REGEXP_REPLACE(ce.value,'[^0-9.]','') AS FLOAT64) - 32) * 5./9

when conversion_unit = 'percent' and SAFE_CAST(REGEXP_REPLACE(ce.value,'[^0-9.]','') AS FLOAT64) <= 1 then SAFE_CAST(REGEXP_REPLACE(ce.value,'[^0-9.]','') AS FLOAT64) * 100

else SAFE_CAST(REGEXP_REPLACE(ce.value,'[^0-9.]','') AS FLOAT64) end as std_vital_reading -- standard vital reading

FROM

`physionet-data.mimiciii_clinical.chartevents` ce

join `mimic_iii_staging.feature_lookup` fl

on ce.itemid = fl.item_code;

Figure 8: Table vital_std after standardising all vitals into consistent units of measurement.

2.2 Outlier Handling

Once the measurements are standardised, next comes the outlier handling. Outliers can have a disproportionate effect on the machine learning models and can skew the results. For ICU vitals, we define the outlier ranges for each vitals in the variable_range table in the mimic_iii_staging and eICU_staging dataset.

The variable_range table data is sourced from the MIMIC-Extract paper and the outlier handling methodology described in the paper is used in the query below. The variable_range table in the big query is first created by uploading the variable_range.csv file available on the git repo from the MIMIC-Extract paper. The variable_range table snapshot is shown in figure 9 below.

Figure 9: Table variable_range used as a reference for outlier range.

We now join the earlier created vital_std table with the variable_range table to handle outliers and store data in the new table vital_outlier. If the vital value is outside the outlier boundary then the vital value is replaced with null. If the vital is inside the outlier boundary, then a valid high or valid low value is applied.

The outlier query is shown below. The output of the outlier handling is stored in the table vital_outlier. A similar query is executed for the eICU data by joining the nursecharting table with the variable_range reference table.

--outlier handling for mimic-iii data
create or replace table mimic_iii_staging.vital_outlier
as
SELECT
 vs.icustay_id
, vs.charttime
, vs.vital_name --vital name
, case
when vs.std_vital_reading < OUTLIER_LOW or vs.std_vital_reading > OUTLIER_HIGH then Null
when vs.std_vital_reading < VALID_LOW then VALID_LOW
when vs.std_vital_reading > VALID_HIGH then VALID_HIGH
else vs.std_vital_reading
end as outlier_handled_vital_reading --outlier corrected vital reading
, vr.*
FROM `mimic_iii_staging.vital_std` vs
left join `mimic_iii_staging.variable_range` vr
on lower(vs.vital_name) = lower(vr.LEVEL2);

Figure 10: Table vital_outlier after outliers are handled for mimic-iii data.

3. Aggregating and Pivoting

The vital data further needs to be aggregated as the vital readings are captured at different hours and times for each patient in the ICU and may also differ in their frequency. We aggregate the data by taking a mean vital reading for each hour for each patient.

For e.g., if there are 3 heart rate readings between 2:00 pm and 3:00 pm, then a single mean heart rate across those 3 readings would be stored against 3 pm vital reading charttime for an ICU stay. Again, big query SQL makes it easy to aggregate the data. The output of the aggregation step is shown in figure 11.

--aggregate vital readings by hour for each icu stay for mimic-iii data
create or replace table `mimic_iii_staging.vital_aggs`
as
with icu_vital_data
as
(
select
o.icustay_id
,DATETIME_TRUNC(i.intime, HOUR) as icu_intime -- round it to the nearest hour
,o.charttime - DATETIME_TRUNC(i.intime, HOUR) as diff_chart_intime -- difference between charttime and icu admitted time
,o.vital_name
,o.outlier_handled_vital_reading
from
`mimic_iii_staging.vital_outlier` o
left join `physionet-data.mimiciii_clinical.icustays` i
on o.icustay_id = i.ICUSTAY_ID
)
select
 icustay_id
,icu_intime
,EXTRACT(day from diff_chart_intime) * 24 + EXTRACT(HOUR from diff_chart_intime) + case when EXTRACT(MINUTE from diff_chart_intime) >=1 orEXTRACT(SECOND from diff_chart_intime) >=1 then 1 else 0 end as hour_from_intime -- number of hours from icu admitted time
,replace(lower(vital_name),' ','_') as feature_name
,avg(outlier_handled_vital_reading) as feature_mean_value
from icu_vital_data
group by icustay_id, icu_intime, hour_from_intime, feature_name

Figure 11: Table vital_aggs after vitals are aggregated by the hour for each ICU stay for mimic-iii data.

After aggregation, we now need to pivot the data, so each vital becomes a column and row is an instance of reading at a particular hour for an ICU stay. This will make the data for machine learning algorithms to train on.

We can pivot the data using the PIVOT functionality in google big query. For using the pivot, we first create a label string containing all the vital names we want to convert into columns. Then using the label string, we use the pivot to create a column for each vital feature. The output of this process is stored in table vital_aggs_pivot and can be seen in figure 12 below.

--pivot aggregate vital readings by hour for each icu stay for mimic-iii data
DECLARE labels STRING;
SET labels = (
SELECT
CONCAT('("', STRING_AGG(DISTINCT feature_name, '", "'), '")'),
from `mimic_iii_staging.vital_aggs`
);
EXECUTE IMMEDIATE format("""
create or replace table `mimic_iii_staging.vital_aggs_pivot`
as
select * from
(
select
icustay_id,
icu_intime,
hour_from_intime,
feature_name,
feature_mean_value
from `mimic_iii_staging.vital_aggs`
)
PIVOT
(
avg(feature_mean_value) as p
for feature_name in %s
)
ORDER BY hour_from_intime ASC
""", labels);

Figure 12: Table vital_aggs_pivot after vitals are pivoted for each ICU stay for mimic-iii data.

4. Imputation

The last data processing step before feature engineering is imputing the missing vital readings for patients. For imputation, the imputation process is inspired by the paper predicting circulatory failure. The imputation method is very simple.

If the first ICU reading is missing, we impute it with a normal value for that vital, which is also sourced from the same paper. Then the missing vital reading is imputed using the last reading available for that vital and then if it is still missing the mean for that vital reading is used to impute the missing value. This can be achieved using SQL as shown in the below queries, which are executed in sequence. Big query SQL provides analytic functions which allow it to compute values over a group of rows and returns a single result for each row.

The below query uses the last_value() to get the value of a particular vital over a window which acts as a partition on the data. For this data, the data is partitioned by an ICU stay and ordered by the time when the reading was recorded. A similar analytic function is used for getting the mean of a vital for an ICU stay. The queries create a final table vital_imputed which contains the imputed value for missing vital readings as shown in figure 13c.

--impute if first icu reading is missing for mimic-iii data
create or replace table `mimic_iii_staging.vital_impute_first_reading`
as
-- fill first missing reading with normal vital values
select
icustay_id,
icu_intime,
hour_from_intime,
case when  p_diastolic_blood_pressure is null or p_diastolic_blood_pressure = 0 then 75 else p_diastolic_blood_pressure end as p_diastolic_blood_pressure,
case when p_heart_rate is null or p_heart_rate= 0 then 70  else p_heart_rate end as p_heart_rate,
case when p_oxygen_saturation is null or p_oxygen_saturation= 0 then 98 else p_oxygen_saturation end as p_oxygen_saturation,
case when p_temperature is null or p_temperature= 0 then 37 else p_temperature end as p_temperature,
case when p_systolic_blood_pressure is null or p_systolic_blood_pressure= 0 then 125 else p_systolic_blood_pressure end as p_systolic_blood_pressure,
case when p_respiratory_rate is null or p_respiratory_rate= 0 then  12 else p_respiratory_rate end as p_respiratory_rate,
case when p_glascow_coma_scale_total is null or p_glascow_coma_scale_total= 0 then  15 else p_glascow_coma_scale_total end as p_glascow_coma_scale_total
from
mimic_iii_staging.vital_aggs_pivot a
where a.hour_from_intime =
(
select min(b.hour_from_intime) from mimic_iii_staging.vital_aggs_pivot b
where a.icustay_id = b.icustay_id
)
union all
-- other records
select
icustay_id,
icu_intime,
hour_from_intime,
p_diastolic_blood_pressure,
p_heart_rate,
p_oxygen_saturation,
p_temperature,
p_systolic_blood_pressure,
p_respiratory_rate,
p_glascow_coma_scale_total
from
mimic_iii_staging.vital_aggs_pivot a
where a.hour_from_intime not in
(
select min(b.hour_from_intime) from mimic_iii_staging.vital_aggs_pivot b
where a.icustay_id = b.icustay_id
)
--impute and fill reading if the last hour has a reading for mimic-iii data
create or replace table `mimic_iii_staging.vital_impute_last_reading`
as
select
icustay_id,
icu_intime,
hour_from_intime,
case when p_diastolic_blood_pressure is null or p_diastolic_blood_pressure = 0 then
LAST_VALUE(p_diastolic_blood_pressure IGNORE NULLS)
over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
else p_diastolic_blood_pressure end as i_diastolic_blood_pressure,
case when p_heart_rate is null or p_heart_rate = 0 then
LAST_VALUE(p_heart_rate IGNORE NULLS)
over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
else p_heart_rate end as i_heart_rate,
case when p_oxygen_saturation is null or p_oxygen_saturation = 0 then
LAST_VALUE(p_oxygen_saturation IGNORE NULLS)
over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
else p_oxygen_saturation end as i_oxygen_saturation,
case when p_systolic_blood_pressure is null or p_systolic_blood_pressure = 0  then
LAST_VALUE(p_systolic_blood_pressure IGNORE NULLS)
over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
else p_systolic_blood_pressure end as i_systolic_blood_pressure,
case when p_respiratory_rate is null or p_respiratory_rate = 0  then
LAST_VALUE(p_respiratory_rate IGNORE NULLS)
over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
else p_respiratory_rate end as i_respiratory_rate,
case when p_temperature is null or p_temperature = 0 then
LAST_VALUE(p_temperature IGNORE NULLS)
over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
else p_temperature end as i_temperature,
case when p_glascow_coma_scale_total is null or p_glascow_coma_scale_total = 0 then
LAST_VALUE(p_glascow_coma_scale_total IGNORE NULLS)
over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
else p_glascow_coma_scale_total end as i_gcs
FROM `mimic_iii_staging.vital_impute_first_reading`
--impute using mean of the vital for each icu stay for mimic-iii data
create or replace table `mimic_iii_staging.vital_imputed`
as
select
icustay_id,
icu_intime,
hour_from_intime,
case when i_diastolic_blood_pressure is null or i_diastolic_blood_pressure = 0 then
avg(i_diastolic_blood_pressure) over (partition by icustay_id)
else i_diastolic_blood_pressure end as i_diastolic_blood_pressure,
case when i_heart_rate is null or i_heart_rate = 0 then
avg(i_heart_rate) over (partition by icustay_id)
else i_heart_rate end as i_heart_rate,
case when i_oxygen_saturation is null or i_oxygen_saturation = 0 then
avg(i_oxygen_saturation )
over (partition by icustay_id )
else i_oxygen_saturation end as i_oxygen_saturation,
case when i_systolic_blood_pressure is null or i_systolic_blood_pressure = 0  then
avg(i_systolic_blood_pressure )
over (partition by icustay_id )
else i_systolic_blood_pressure end as i_systolic_blood_pressure,
case when i_respiratory_rate is null or i_respiratory_rate = 0  then
avg(i_respiratory_rate )
over (partition by icustay_id )
else i_respiratory_rate end as i_respiratory_rate,
case when i_temperature is null or i_temperature = 0 then
avg(i_temperature )
over (partition by icustay_id )
else i_temperature end as i_temperature,
case when i_gcs is null or i_gcs = 0 then
avg(i_gcs )
over (partition by icustay_id )
else i_gcs end as i_gcs
FROM
`mimic_iii_staging.vital_impute_last_reading`;

Figure 13a: Table vital_impute_first_reading with first ICU vital reading replaced by normal values for the vital.

Figure 13b: Table vital_impute_last_reading with missing values replaced by the last reading of that vital.

Figure 13c: Table vital_imputed with missing values imputed.

5. Feature Engineering For Time Series

For deep learning models like Recurrent Neural Networks (RNN), this step is not needed as the models are capable of finding relationships between different time step data. For machine learning models like decision trees, logistic regression, etc., it is useful to create time-series features like minimum, maximum, mean for each vitals for different time series windows of 4,6,8 hours. We leverage the SQL window functions for the same. A sample time-series feature for diastolic blood pressure is shown in the query below and the last reading of the patient for an ICU stay is filtered. This query can be extended to other 6 vital features. This query will output one single row for each ICU stay as shown in figure 14.

-- time-series features for diastolic blood pressure vital using window functions
with agg_time_series_features
as
(
select
icustay_id,
icu_intime,
hour_from_intime,
i_diastolic_blood_pressure,
max(i_diastolic_blood_pressure) over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) max_4_hr_diastolic_bp,
max(i_diastolic_blood_pressure) over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) max_6_hr_diastolic_bp,
max(i_diastolic_blood_pressure) over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 8 PRECEDING AND CURRENT ROW) max_8_hr_diastolic_bp,
min(i_diastolic_blood_pressure) over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) min_4_hr_diastolic_bp,
min(i_diastolic_blood_pressure) over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) min_6_hr_diastolic_bp,
min(i_diastolic_blood_pressure) over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 8 PRECEDING AND CURRENT ROW) min_8_hr_diastolic_bp,
avg(i_diastolic_blood_pressure) over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) avg_4_hr_diastolic_bp,
avg(i_diastolic_blood_pressure) over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) avg_6_hr_diastolic_bp,
avg(i_diastolic_blood_pressure) over (partition by icustay_id order by hour_from_intime ASC ROWS BETWEEN 8 PRECEDING AND CURRENT ROW) avg_8_hr_diastolic_bp,
rank()over (partition by icustay_id order by hour_from_intime desc) as row_number_reverse
from `mimic_iii_staging.vital_imputed` a
)
select * from agg_time_series_features
where row_number_reverse = 1

Figure 14: Time-series features for machine learning

All the above queries can be extended to include lab values like sodium, potassium, etc. for a patient and passed on to machine learning models as features. The above queries can also be automated from python scripts using the big query API for python.

After the feature engineering step, the data is ready to be used for the training of machine learning models.

Thank you to Omdena for providing this opportunity and a big shout-out to Sijuade Oguntayo for his support and inspiration throughout the project.

This article is written by Sanjana Tule.

Related: Time-Series Classification Tutorial: Combining Static and Sequential Feature Modeling using RNNs