Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

import eCRF data (.xls, .xslx, SQLite, ...) #104

Closed
rkrenn opened this issue Mar 4, 2020 · 1 comment · Fixed by #171, phoenixctms/bulk-processor#9, phoenixctms/install-debian#12 or phoenixctms/config-default#6
Assignees
Labels

Comments

@rkrenn
Copy link
Collaborator

rkrenn commented Mar 4, 2020

The bulk-processor supports exporting eCRF data in .xlsx and .csv format. It also provides exporting to a SQLite ".db" file, containing eCRF data including metadata in a verbose, denormalized schema ("vertical" format). eCRF data can also be exported in "horizontal" formats (single row per subject).

The request is to implement a multi-purpose importer which processes these various file formats to populate eCRFs.
This way the clinical data can be easily transfered from instance to instance. It should also be possible to import files generated externally (eg. lab data).

It is planned to trigger imports from either the command line or in the UI (trial's "Jobs" tab).

@rkrenn rkrenn added the feature label Mar 4, 2020
@rkrenn rkrenn changed the title SQLite importer import eCRF data (.xls, .xslx, SQLite, ...) Apr 17, 2021
@rkrenn
Copy link
Collaborator Author

rkrenn commented May 11, 2021

Overview

the eCRF data importer can be launched from commandline ( phoenixctms/install-debian#12 )

phoenix@phoenix:~$ sudo -u ctsms /ctsms/ecrfdataimport.sh \
  --task=import_ecrf_data_horizontal \
  --file="/path/to/my/file.xlsx" \
  -id <trial-id>

or the trials' "Jobs" tab.

grafik

Currently it supports loading the intuitive "horizontal" .csv/.xls/.xlsx formats ("single row per subject"). It will work for partial data (eg. you want to populate only certain eCRF fields such as lab data) and is safe for repeated execution. This means it will lookup the subject by alias or subject list attributes, and create&register a subject if it does not exist yet.

It is implemented as BulkProcessor project ( phoenixctms/bulk-processor#9 ), so it is a multithreaded perl program connecting to the PhoenixCTMS webapp via REST-API:

  • single reader process fetching spreadsheet rows from the input file
  • multiple worker threads, each of which processing rows by saving eCRF values on-by-one or section-by-section

Disabled ecrfs/fields are skipped, as well as locked/signed eCRFs. By default errors (such as values exceding configured range limits) will be logged to the job output, but will not abort the import. It mimics data entry by users and is protected from mutual interference in the same way (optimistic locking).

There are similar config options like the eCRF exporter has, which can be specified in settings.yml files ( phoenixctms/config-default#6 ). It eg. allows to tune how to derive and abbreviate column names (unless they have explicit "External ID"s). so eCRF data export and import by default will work with the same dictionary of column names.

File Fomats

Both eCRF exporter and importer agree on .csv file specs below:

  • field delimiter: ; (semicolon)
  • line delimiter: \r\n (CR+LF)
  • quote symbol: " (doublequotes)
  • escaped quote symbol: "" (2x double quotes)
  • unsaved values are exported as values (denoted by empty string, without quotes)
  • utf-8 character encoding
  • empty (but persisted) values (such as empty INTEGER, DECIMAL, DATE, ... ) are denoted by an empty string ""
  • timestamps/dates and time of day values require YYYY-MM-DD HH:MM:SS format (same as exported value)
    • the HH:MM:SS part is ignored for dates
    • time of days require a 1970-01-01 date part
  • if a row's first element starts with "#", it is considered as a file comment, so the whole row is skipped
  • empty rows are skipped
  • the first non-empty row is considered the "header row", which must show the column names matching eCRF fields. Unknown (unmatched) columns will be ignored.

Support for excel date values and explicit cell format is still pening for excel format in both importer and exporter. So be aware it loves coercing strings to numbers so effectively cutting off leading zeroes when using excel file types!

Validation

Importing an exported .csv will create a perfect clone of the dataset, so it's suggested to use .csv over Excel formats for exchanging data. This allows to give a proof of correctness:

  1. create a trial X with a simple eCRF containing all field types, each with optional, disabled and series attributes. A sample eCRF setup can be found here 8104480.xls
  2. populate it with data, including unicode symbols. A sample .csv for import can be found here
    ecrf_horizontal_20210510171826.zip
  3. create another trial Y with the same eCRF setup like trial X (launch job "export eCRF setup" in trial X and then "import eCRF setup" in in trial Y)
  4. export all trial X data (launch job "export eCRF data" in trial X)
  5. grab the .csv dataset from the "Files" tab of trial X, then import it in trial Y (launch job "import eCRF data")

Now database views can be prepared to obtain the datasets in SQL directly:

drop view if exists ecrf_data;
drop view if exists ecrf_data_version;
create view ecrf_data_version as 
select 
  le.trial_fk as trial_id, 
  le.proband_fk as proband_id, 
  e.name as ecrf_name, 
  vis.token as visit_token, 
  ef.id as ecrf_field_id, 
  ef.section as ecrf_field_section, 
  ef.position as ecrf_field_position, 
  a.index, 
  if.name_l10n_key as input_field_name, 
  v.boolean_value, 
  v.long_value, 
  v.float_value, 
  v.date_value, 
  v.timestamp_value, 
  v.string_value, 
  array_agg(
    sv.value 
    order by 
      sv.value
  ) as selection_set_values, 
  a.version 
from 
  ecrf_field_value a 
  left join visit vis on a.visit_fk = vis.id 
  join ecrf_field ef on a.ecrf_field_fk = ef.id 
  join ecrf e on ef.ecrf_fk = e.id 
  join input_field if on ef.field_fk = if.id 
  join input_field_value v on a.value_fk = v.id 
  join proband_list_entry le on a.list_entry_fk = le.id 
  left join input_field_value_selection ifs on ifs.input_field_values_fk = v.id 
  left join input_field_selection_set_value sv on ifs.selection_values_fk = sv.id 
group by 
  trial_id, 
  proband_id, 
  ecrf_name, 
  visit_token, 
  ecrf_field_id, 
  ecrf_field_section, 
  ecrf_field_position, 
  index, 
  input_field_name, 
  v.boolean_value, 
  v.long_value, 
  v.float_value, 
  v.date_value, 
  v.timestamp_value, 
  v.string_value, 
  a.version 
order by 
  trial_id, 
  proband_id, 
  ecrf_name, 
  visit_token, 
  ecrf_field_section, 
  ecrf_field_position, 
  index, 
  a.version;
create view ecrf_data as 
select 
  all_versions.trial_id, 
  all_versions.proband_id, 
  all_versions.ecrf_name, 
  ef.disabled as ecrf_field_disabled, 
  e.disabled as ecrf_disabled, 
  all_versions.visit_token, 
  all_versions.ecrf_field_section, 
  all_versions.ecrf_field_position, 
  all_versions.index, 
  all_versions.input_field_name, 
  all_versions.boolean_value, 
  all_versions.long_value, 
  all_versions.float_value, 
  all_versions.date_value, 
  all_versions.timestamp_value, 
  all_versions.string_value, 
  all_versions.selection_set_values 
from 
  (
    select 
      proband_id, 
      visit_token, 
      ecrf_field_id, 
      index, 
      max(version) as version 
    from 
      ecrf_data_version 
    group by 
      proband_id, 
      visit_token, 
      ecrf_field_id, 
      index
  ) max_version 
  join ecrf_data_version all_versions on all_versions.proband_id = max_version.proband_id 
  and all_versions.visit_token is not distinct 
from 
  max_version.visit_token 
  and all_versions.ecrf_field_id = max_version.ecrf_field_id 
  and all_versions.index is not distinct 
from 
  max_version.index 
  and all_versions.version = max_version.version 
  join ecrf_field ef on all_versions.ecrf_field_id = ef.id 
  join ecrf e on ef.ecrf_fk = e.id 
order by 
  all_versions.trial_id, 
  all_versions.proband_id, 
  all_versions.ecrf_name, 
  all_versions.visit_token, 
  all_versions.ecrf_field_section, 
  all_versions.ecrf_field_position, 
  all_versions.index;

The datasets can be compared deeply using SQL set operations. If exactly equal, both "trial X dataset EXCEPT trial Y dataset" (see below) and "trial Y dataset EXCEPT trial X dataset" have to give empty results.

(
  select 
    ecrf_name, 
    visit_token, 
    ecrf_field_section, 
    ecrf_field_position, 
    index, 
    input_field_name, 
    boolean_value, 
    long_value, 
    float_value, 
    date_value, 
    timestamp_value, 
    string_value, 
    selection_set_values 
  from 
    ecrf_data 
  where 
    trial_id = <trial X id>
) 
except 
  (
    select 
      ecrf_name, 
      visit_token, 
      ecrf_field_section, 
      ecrf_field_position, 
      index, 
      input_field_name, 
      boolean_value, 
      long_value, 
      float_value, 
      date_value, 
      timestamp_value, 
      string_value, 
      selection_set_values 
    from 
      ecrf_data 
    where 
      trial_id = <trial Y id>
  );

for the given example eCRF setup this will however report values of SKETCH input types not present in trial Y. which is expected since sketch data is currently not exported/imported.

+------------------------------+-------------------------+---------------------+--------------------+-----+----------------------+
| ecrf_name                    | ecrf_field_section      | ecrf_field_position | input_field_name   | ... | selection_set_values |
+------------------------------+-------------------------+---------------------+--------------------+-----+----------------------+
| 99. all field types OPTIONAL | 01 section "selections" |                   6 | import test sketch | ... | {NULL}               |
| 99. all field types          | 01 section "selections" |                   6 | import test sketch | ... | {"shoulder left"}    |
+------------------------------+-------------------------+---------------------+--------------------+-----+----------------------+

# for free to join this conversation on GitHub. Already have an account? # to comment