About the Center for Big Data in Health Sciences (CBD-HS)
Revolutionizing public health through Big Data.
The Center for Big Data in Health Sciences is a coalition of faculty and staff from across the Texas Medical Center, including the School of Public Health, School of Biomedical Informatics, MD Anderson Cancer Center, McGovern Medical School and more, who are working together to solve public health problems with one of science’s most untapped resources—Big Data.
- Build a national/international-level Big Data research program for biomedical and health sciences via developing/promoting use of state-of-the-art Big Data analytic approaches and technologies
- Build a data-driven research platform to bridge the gap between the computational/quantitative scientists and biomedical/health investigators
- Support development of data science education programs to train next generation of health data scientists
- Engage and develop partnerships with industries to promote individual health and community well-being by improving diagnosis, treatment, and prevention of diseases and injuries using Big Data
We are seeking CBD-HS members with expertise in the following areas:
- Statistical methodologies for Big Data analytics
- Bioinformatics data analysis and modeling: Omics data analysis and integration
- Biomathematical modeling and computational biology
- Big Data analytics software development
- Data mining and machine learning
- Expertise and experience in novel data types: text documents, audio, video, EMR, EHR, mHealth, imaging data, EEG, sensor-based data, wearable device data, GPS data, location-based data, social media data, network data et.
- High Performance Computing: parallel computing, cloud computing, high performance computing algorithms, numerical optimization algorithms
- Any clinical, biomedical and health science investigators who are interested in using Big Data for their research and practice
Current research initiatives
If you are interested in learning more about any of the research initiatives we are working on, or would like to get involved, please contact Kevin Banks.
GEO Big Data Project
- Develop scalable Big Data analytic pipelines to analyze a large number of time course gene expression data sets from the GEO data repository
- Develop a web-based collaboration platform to share the large number of analysis results with genetic and biomedical collaborators in order to extract scientific insights and disseminate the large number of findings from the scalable analytic pipelines via publications.
EHR Collaboration Working Group
Promoting collaborations between statisticians/data scientists and biomedical/clinical/epidemiological investigators to use EHR/EMR and medical insurance claim data to develop predictive models for disease risks and evaluate effects of clinical treatments to provide treatment recommendations based on the real-world evidence
EHR Methodology Research Working Group
Developing novel statistical methods and predictive models for EHR and medical insurance claim data in order to address clinical and public health questions
UK Biobank Research Working Group
Developing novel predictive models and statistical methods to integrate heterogeneous and different types of data from the UK Biobank study to address epidemiological and public health questions
How can we help you?
Contributions to UTHealth community: Collaboration/consulting service and support
We provide collaboration support and consulting services to biomedical and health science investigators:
- Design research projects and tools/strategies for Big Data collection
- Develop database or data warehouse for Big Data management
- Big Data harmonization and integration
- Big Data visualization
- Big Data analytics
- Big Data modeling and predictions
- A Big Data research platform for Big Data identification, management, integration, visualization, analytics, modeling and prediction will be developed to support the Big Data research at UTHealth.
We will actively develop collaborations and partnerships with related industries, including local companies, national and international corporations who may own Big Data and need analytic support. This will not only benefit our Center's faculty for research purpose, but also this is good for our students to get more opportunities for summer internships and jobs.
CBD-HS available resources
Data Resources, Cerner Health Facts
The Cerner Health Facts database covers all of the health care records for 85 systems with 750 facilities in the United States from 2000 to 2018. The patient-level data in Cerner includes longitudinal encounters with detailed records of diagnoses, medications, clinical events, procedures and lab procedures. It represents a total of 69 million unique patients across the United States. Of the 69 million patients, 52% are female and 42% are male (6% are gender-unidentified). The racial makeup of the 69 million patients is 49.5% Caucasian, 11.8% African American, 2.9% Hispanic, 1.8% Asian and Native American, less than 1% Pacific Islander, Middle Eastern Indian, and 16.4% racial status unidentified. Patient marital status is 33% married, 22.6% single, 3.3% divorced, 3% widowed, and others are marital status unidentified. The mean patient age is 46.8 years old, with a range of 0-90 years old. In total, the database includes 487 million unique encounters with 939 million diagnoses, coded in International Classification of Diseases (ICD-9) codes. The database has 674 million medication records, 118 million procedure records, 5.3 billion clinical event records and 4.2 billion lab procedure records.
The Department of Biostatistics and Data Science hosts several state-of-the-art high performance computing equipment. Two recently acquired HPE servers each with 36 cores 72 threads, 768GB memory, and 2 x NVIDIA V100 GPU/16GB. The two servers are connected with a HPE 3PAR storage node of 192 TB capacity by 2 x 10Gbps fiber, and clustered to a Hadoop/HBase/Spark system for big data analysis. The department also has 3 other servers shown in the figure below. [PHOTO]
A team of highly skilled technical staff provides support for computing, data management, and networking. The Department has a programmer analyst/system administrator to install, maintain, and manage all hardware, software, and networks. A database manager to assist with database creation, manipulation and retrieval. The School of Public Health also provides additional assistance through the IT Department with computer services, network services, telecom, administrative support, and help desk.
Texas Advanced Computing Center (TACC)
The Texas Advanced Computing Center (TACC) is a service available to UT researchers that help in utilizing powerful advanced computing technologies. TACC designs and deploys the world's most powerful advanced computing technologies and innovative software solutions. TACC's environment includes a comprehensive cyberinfrastructure ecosystem of leading-edge resources in high performance computing, visualization, data analysis, storage, archive, cloud, data-driven computing, connectivity, tools, APIs, algorithms, consulting, and software. They provide systems and software support to researchers, and have worked on over 3000 projects by more than 1000 researchers at over 350 institutions nationally and worldwide that address scientific concepts to improve the quality of life. TACC has a number of HPC clusters including, “Stampede” with 6400 computing nodes, 102,656 cores, 205 terabytes of memory and a peak performance of 10 petaflops (PF), ranked #10 in the world Top500 Supercomputers, November 2015), “Lonestar” which UT System institution investigators have exclusive access to has 1901 computing nodes, 22,256 cores and 302 TF theoretical peak performance, “Corral” is a collection of storage and data management resources primarily located at TACC, with 5 petabytes of storage installed in the UT data centers at TACC and in Arlington, and an additional petabyte of unreplicated storage for low-latency applications.
Additional databases and centers