import eCRF data (.xls, .xslx, SQLite, ...) #104

rkrenn · 2020-03-04T16:23:16Z

The bulk-processor supports exporting eCRF data in .xlsx and .csv format. It also provides exporting to a SQLite ".db" file, containing eCRF data including metadata in a verbose, denormalized schema ("vertical" format). eCRF data can also be exported in "horizontal" formats (single row per subject).

The request is to implement a multi-purpose importer which processes these various file formats to populate eCRFs.
This way the clinical data can be easily transfered from instance to instance. It should also be possible to import files generated externally (eg. lab data).

It is planned to trigger imports from either the command line or in the UI (trial's "Jobs" tab).

rkrenn · 2021-05-11T17:04:14Z

Overview

the eCRF data importer can be launched from commandline ( phoenixctms/install-debian#12 )

phoenix@phoenix:~$ sudo -u ctsms /ctsms/ecrfdataimport.sh \
  --task=import_ecrf_data_horizontal \
  --file="/path/to/my/file.xlsx" \
  -id <trial-id>

or the trials' "Jobs" tab.

Currently it supports loading the intuitive "horizontal" .csv/.xls/.xlsx formats ("single row per subject"). It will work for partial data (eg. you want to populate only certain eCRF fields such as lab data) and is safe for repeated execution. This means it will lookup the subject by alias or subject list attributes, and create&register a subject if it does not exist yet.

It is implemented as BulkProcessor project ( phoenixctms/bulk-processor#9 ), so it is a multithreaded perl program connecting to the PhoenixCTMS webapp via REST-API:

single reader process fetching spreadsheet rows from the input file
multiple worker threads, each of which processing rows by saving eCRF values on-by-one or section-by-section

Disabled ecrfs/fields are skipped, as well as locked/signed eCRFs. By default errors (such as values exceding configured range limits) will be logged to the job output, but will not abort the import. It mimics data entry by users and is protected from mutual interference in the same way (optimistic locking).

There are similar config options like the eCRF exporter has, which can be specified in settings.yml files ( phoenixctms/config-default#6 ). It eg. allows to tune how to derive and abbreviate column names (unless they have explicit "External ID"s). so eCRF data export and import by default will work with the same dictionary of column names.

File Fomats

Both eCRF exporter and importer agree on .csv file specs below:

field delimiter: ; (semicolon)
line delimiter: \r\n (CR+LF)
quote symbol: " (doublequotes)
escaped quote symbol: "" (2x double quotes)
unsaved values are exported as values (denoted by empty string, without quotes)
utf-8 character encoding
empty (but persisted) values (such as empty INTEGER, DECIMAL, DATE, ... ) are denoted by an empty string ""
timestamps/dates and time of day values require YYYY-MM-DD HH:MM:SS format (same as exported value)
- the HH:MM:SS part is ignored for dates
- time of days require a 1970-01-01 date part
if a row's first element starts with "#", it is considered as a file comment, so the whole row is skipped
empty rows are skipped
the first non-empty row is considered the "header row", which must show the column names matching eCRF fields. Unknown (unmatched) columns will be ignored.

Support for excel date values and explicit cell format is still pening for excel format in both importer and exporter. So be aware it loves coercing strings to numbers so effectively cutting off leading zeroes when using excel file types!

Validation

Importing an exported .csv will create a perfect clone of the dataset, so it's suggested to use .csv over Excel formats for exchanging data. This allows to give a proof of correctness:

create a trial X with a simple eCRF containing all field types, each with optional, disabled and series attributes. A sample eCRF setup can be found here 8104480.xls
populate it with data, including unicode symbols. A sample .csv for import can be found here
ecrf_horizontal_20210510171826.zip
create another trial Y with the same eCRF setup like trial X (launch job "export eCRF setup" in trial X and then "import eCRF setup" in in trial Y)
export all trial X data (launch job "export eCRF data" in trial X)
grab the .csv dataset from the "Files" tab of trial X, then import it in trial Y (launch job "import eCRF data")

Now database views can be prepared to obtain the datasets in SQL directly:

drop view if exists ecrf_data;
drop view if exists ecrf_data_version;
create view ecrf_data_version as 
select 
  le.trial_fk as trial_id, 
  le.proband_fk as proband_id, 
  e.name as ecrf_name, 
  vis.token as visit_token, 
  ef.id as ecrf_field_id, 
  ef.section as ecrf_field_section, 
  ef.position as ecrf_field_position, 
  a.index, 
  if.name_l10n_key as input_field_name, 
  v.boolean_value, 
  v.long_value, 
  v.float_value, 
  v.date_value, 
  v.timestamp_value, 
  v.string_value, 
  array_agg(
    sv.value 
    order by 
      sv.value
  ) as selection_set_values, 
  a.version 
from 
  ecrf_field_value a 
  left join visit vis on a.visit_fk = vis.id 
  join ecrf_field ef on a.ecrf_field_fk = ef.id 
  join ecrf e on ef.ecrf_fk = e.id 
  join input_field if on ef.field_fk = if.id 
  join input_field_value v on a.value_fk = v.id 
  join proband_list_entry le on a.list_entry_fk = le.id 
  left join input_field_value_selection ifs on ifs.input_field_values_fk = v.id 
  left join input_field_selection_set_value sv on ifs.selection_values_fk = sv.id 
group by 
  trial_id, 
  proband_id, 
  ecrf_name, 
  visit_token, 
  ecrf_field_id, 
  ecrf_field_section, 
  ecrf_field_position, 
  index, 
  input_field_name, 
  v.boolean_value, 
  v.long_value, 
  v.float_value, 
  v.date_value, 
  v.timestamp_value, 
  v.string_value, 
  a.version 
order by 
  trial_id, 
  proband_id, 
  ecrf_name, 
  visit_token, 
  ecrf_field_section, 
  ecrf_field_position, 
  index, 
  a.version;
create view ecrf_data as 
select 
  all_versions.trial_id, 
  all_versions.proband_id, 
  all_versions.ecrf_name, 
  ef.disabled as ecrf_field_disabled, 
  e.disabled as ecrf_disabled, 
  all_versions.visit_token, 
  all_versions.ecrf_field_section, 
  all_versions.ecrf_field_position, 
  all_versions.index, 
  all_versions.input_field_name, 
  all_versions.boolean_value, 
  all_versions.long_value, 
  all_versions.float_value, 
  all_versions.date_value, 
  all_versions.timestamp_value, 
  all_versions.string_value, 
  all_versions.selection_set_values 
from 
  (
    select 
      proband_id, 
      visit_token, 
      ecrf_field_id, 
      index, 
      max(version) as version 
    from 
      ecrf_data_version 
    group by 
      proband_id, 
      visit_token, 
      ecrf_field_id, 
      index
  ) max_version 
  join ecrf_data_version all_versions on all_versions.proband_id = max_version.proband_id 
  and all_versions.visit_token is not distinct 
from 
  max_version.visit_token 
  and all_versions.ecrf_field_id = max_version.ecrf_field_id 
  and all_versions.index is not distinct 
from 
  max_version.index 
  and all_versions.version = max_version.version 
  join ecrf_field ef on all_versions.ecrf_field_id = ef.id 
  join ecrf e on ef.ecrf_fk = e.id 
order by 
  all_versions.trial_id, 
  all_versions.proband_id, 
  all_versions.ecrf_name, 
  all_versions.visit_token, 
  all_versions.ecrf_field_section, 
  all_versions.ecrf_field_position, 
  all_versions.index;

The datasets can be compared deeply using SQL set operations. If exactly equal, both "trial X dataset EXCEPT trial Y dataset" (see below) and "trial Y dataset EXCEPT trial X dataset" have to give empty results.

(
  select 
    ecrf_name, 
    visit_token, 
    ecrf_field_section, 
    ecrf_field_position, 
    index, 
    input_field_name, 
    boolean_value, 
    long_value, 
    float_value, 
    date_value, 
    timestamp_value, 
    string_value, 
    selection_set_values 
  from 
    ecrf_data 
  where 
    trial_id = <trial X id>
) 
except 
  (
    select 
      ecrf_name, 
      visit_token, 
      ecrf_field_section, 
      ecrf_field_position, 
      index, 
      input_field_name, 
      boolean_value, 
      long_value, 
      float_value, 
      date_value, 
      timestamp_value, 
      string_value, 
      selection_set_values 
    from 
      ecrf_data 
    where 
      trial_id = <trial Y id>
  );

for the given example eCRF setup this will however report values of SKETCH input types not present in trial Y. which is expected since sketch data is currently not exported/imported.

+------------------------------+-------------------------+---------------------+--------------------+-----+----------------------+
| ecrf_name                    | ecrf_field_section      | ecrf_field_position | input_field_name   | ... | selection_set_values |
+------------------------------+-------------------------+---------------------+--------------------+-----+----------------------+
| 99. all field types OPTIONAL | 01 section "selections" |                   6 | import test sketch | ... | {NULL}               |
| 99. all field types          | 01 section "selections" |                   6 | import test sketch | ... | {"shoulder left"}    |
+------------------------------+-------------------------+---------------------+--------------------+-----+----------------------+

rkrenn added the feature label Mar 4, 2020

rkrenn changed the title ~~SQLite importer~~ import eCRF data (.xls, .xslx, SQLite, ...) Apr 17, 2021

rkrenn mentioned this issue Apr 20, 2021

import inquiry form data (.xls, .xslx, SQLite, .csv, ...) #170

Open

rkrenn linked a pull request May 11, 2021 that will close this issue

eCRF data importer job #171

Merged

This was referenced May 11, 2021

eCRF data importer implementation phoenixctms/bulk-processor#9

Merged

install ecrfdataimport.sh phoenixctms/install-debian#12

Merged

eCRF data importer default configuration phoenixctms/config-default#6

Merged

rkrenn self-assigned this May 11, 2021

rkrenn closed this as completed in #171 May 13, 2021

rkrenn reopened this May 17, 2021

rkrenn closed this as completed in phoenixctms/config-default#6 May 17, 2021

rkrenn mentioned this issue Sep 28, 2023

Reset expired user's password for user with priviledge to set passwords - link to issue #172 #268

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

import eCRF data (.xls, .xslx, SQLite, ...) #104

import eCRF data (.xls, .xslx, SQLite, ...) #104

rkrenn commented Mar 4, 2020 •

edited

Loading

rkrenn commented May 11, 2021 •

edited

Loading

import eCRF data (.xls, .xslx, SQLite, ...) #104

import eCRF data (.xls, .xslx, SQLite, ...) #104

Comments

rkrenn commented Mar 4, 2020 • edited Loading

rkrenn commented May 11, 2021 • edited Loading

Overview

File Fomats

Validation

rkrenn commented Mar 4, 2020 •

edited

Loading

rkrenn commented May 11, 2021 •

edited

Loading