
Establishment of a Chinese critical care database from electronic healthcare records in a tertiary care medical center
- Select a language for the TTS:
- UK English Female
- UK English Male
- US English Female
- US English Male
- Australian Female
- Australian Male
- Language selected: (auto detect) - EN
Play all audios:

The medical specialty of critical care, or intensive care, provides emergency medical care to patients suffering from life-threatening complications and injuries. The medical specialty is
featured by the generation of a huge amount of high-granularity data in routine practice. Currently, these data are well archived in the hospital information system for the primary purpose
of routine clinical practice. However, data scientists have noticed that in-depth mining of such big data may provide insights into the pathophysiology of underlying diseases and healthcare
practices. There have been several openly accessible critical care databases being established, which have generated hundreds of scientific outputs published in scientific journals. However,
such work is still in its infancy in China. China is a large country with a huge patient population, contributing to the generation of large healthcare databases in hospitals. In this data
descriptor article, we report the establishment of an openly accessible critical care database generated from the hospital information system.
Critically ill patients managed in the intensive care unit (ICU) are usually monitored closely for organ dysfunctions, and are treated intensively by a variety of supportive modalities1,2.
Vital signs, laboratory tests, and medical treatments were obtained at a higher frequency than those treated in the general ward. Such daily intensive management will produce a huge amount
of information including medical orders, imaging studies, laboratory findings, and waveform signals. The data generation mechanisms may reflect key factors related to the healthcare system,
the pathophysiology of underlying disease, and patient’s preferences and cultures3. Thus, in-depth data mining of such large databases, such as risk factor analysis, predictive analytics,
and causal inference4,5,6, can provide more insights into clinical research questions. More knowledge or pearls of wisdom can be obtained from data mining, and the translation of the
knowledge into clinical practice may potentially improve clinical outcomes7,8.
Most published scientific reports do not make their original raw data freely accessible in the current critical care research community, partly attributable to confidentiality issues. The
unwillingness to share data makes it difficult to reproduce the reported results. Furthermore, the exploration of a such large database from a single research group could be biased and
limited. Thus, strenuous efforts have been made to encourage the scientific community to share their raw data, which is also supported by the open data campaign9,10. Several openly
accessible critical care databases have been established, mainly reflecting the healthcare systems of western countries11,12,13. China is a large country with a huge patient population. For
example, the estimated incident sepsis cases are about 3 million in 2017, accounting for nearly 10% of the global incident cases14. Chinese hospitals also have special hospital information
systems that are distinct from those of western countries. However, hospital information systems in Chinese hospitals are mainly used for clinical practice and are far less developed for
research purposes. Data sharing is still in its infancy in the Chinese critical care community, which significantly impairs the transparency of scientific work and international
collaborations. To the best of our knowledge, there are two critical care databases being established in China which focus on pediatric critically ill patients and those with
infections15,16. Here, we reported the establishment of a large critical care database comprising high-granularity data generated from the information system of a tertiary care university
hospital. Details of the database are reported in the paper to encourage new research through secondary analysis of the database.
The study was conducted in Zhejiang Provincial People’s Hospital, Zhejiang, China from January 2012 to May 2022. All patients admitted to the ICU of the hospital were eligible. There were
two ICUs in the hospital: one was the comprehensive central ICU and the other was the emergency ICU (EICU). There was no exclusion criterion in enrolling subjects because we believed that
patients who were excluded by a particular study might be eligible for another study. Thus, we included all records in the information system related to ICU stays. The study was approved by
the ethics committee of Zhejiang Provincial People’s Hospital (approval number: QT2022185). Informed consent was waived as determined by the institutional review board, due to the
retrospective design of the study. The study was conducted in accordance with the Declaration of Helsinki.
The database is distributed as comma-separated value (CSV) files that can be imported to any relational database system. Each file contains a single table which will be further explained in
the subsequent sections. Each individual subject can be identified by a series number (patient_SN) with the combination of digits and letters such as “3c74cf74c36241b7082ec35e458279dc”. Each
unit hospital stay is denoted by a Hospital_ID with examples such as “9432117” and “336688072433”. The unique ICU stay can be identified by the HospitalTransfer table, which contains
intrahospital transfer events for the subjects. All tables use Hospital_ID to identify an individual hospital stay, and the HospitalTransfer table can be used to determine ICU stays linked
to the same patient and/or hospitalization.
We recommend the R package tidyverse for the management of the relational database because of its capability to streamline the workflow from data management to statistical analysis and to
the training of machine learning models17. For large files, we recommend the data.table package to process the tabular data.
All tables are deidentified according to the Health Insurance Portability and Accountability Act (HIPAA). All protected information is removed including addresses, date of birth, date of
hospital admission, date of discharge, date of medical order, personal numbers (e.g. name, phone, social security, and hospital number), exact age on admission (age is discretized into
bins). When creating the dataset, patients were randomly assigned a unique identifier (patient_SN and hospital_ID) and the original hospital identifiers were not retained. As a result, the
identifiers in the database cannot be linked back to the original, identifiable data. All doctor/nurse/pharmacist identifiers have also been removed to protect the privacy of contributing
providers.
The database comprises 8180 unique hospital admissions for 7638 individual patients from January 2012 to May 2022 and is available at the PhysioNet repository18. Table 1 shows the baseline
demographics of hospital admissions. There are 2965 female and 5215 male patients in the dataset. The length of hospital days was 17 days (Q1 to Q3: 10 to 28). Male patients showed slightly
longer hospital stay.
The number of hospital admissions for ICU patients increased remarkably after the year 2018 because of the expansion of bed numbers this year for both comprehensive ICU and emergency ICU
(Fig. 1). The distributions of hospital length of stay are shown in Fig. 2, restricting to patients with a length of stay (LOS)