Skip to content

jhajagos/PreparedSource2OHDSI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scaling EHR (Electronic Health Record) data mapping to the OHDSI CDM

The goal of this project is to scale the mapping of clinical data to the OHDSI CDM (Common Data Model). The OHDSI CDM is the data standard for data analytics on EHR data. The scalable compute is provided by using a SPARK (>3.0) compute environment. Data is written in the Apache Parquet format and can be either directly queried or staged into a relational SQL database.

The mapping from source to OHDSI consists of the following steps:

  1. Stage CSV files extracted from the EHR in a SPARK Cluster accessible location
  2. Map stage CSV data to PSF (Prepared Source Format) format (See Synthea example)
  3. Stage OHDSI Vocabulary/Concept (TSV) files as parquet file
  4. Map PSF to the OHDSI CDM (Currently supported are 5.4 and 5.3.1 and versions) Parquet Format
  5. Register generated parquet files in a database catalog (Delta tables) or insert parquet files into a relational database.

The mapping scripts writes parquet files in an OHDSI "compatible" format. The generated parquet files include additional fields not part of the OHDSI CDM. The additional columns allow the tracking of the initial data provence.

A Docker biuld file is included to map Synthea data to PSF and to OHDSI; see: README.md.

About

Spark based mapper for converting EHR data to OHDSI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published