Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fully integrate the keep_demographic_info and date_format flags #249

Merged
merged 5 commits into from
Jan 21, 2025

Conversation

jessicarowell
Copy link
Collaborator

Description

  • Changed keep_personal_info / val_keep_pi flag to keep_demographic_info and fully implemented it
  • Changed to val_date_format_flag to date_format_flag and fully implemented it
  • Metadata validation will now scrub potentially identifiable information (host sex, age, race, and ethnicity) from the metadata if keep_demographic_info is false.

Checklist

Go Through Checklist Below and Place A ✔️ (X Inside the Box) if Completed

General Checks

  • Have you run appropriate tests (unit/integration/end-to-end) to check logic across run environments (Conda/Docker/Singularity on Scicomp/AWS/NF Tower/Local)?
    SciComp only

    For each relevant configuration:

    • Can the program run completely through without erroring out?
    • Does it produce the expected outputs, given the inputs provided?
  • Have you conducted proper linting procedures?

    • Numpy formatted docstrings for functions
    • Comments explaining lines of code
    • Consistent and intuitive naming conventions for variables, functions, classes, methods, attributes, and scripts
    • Single empty line between class functions, two lines between non-class functions, and two lines between imports and code body
    • Camel case formatting for class names
  • Have you updated existing documentation (README.md, etc.) or created new ones within docs?

CDC Checks

  • Did you check for sensitive data, and remove any?
  • If you added or modified HTML, did you check that it was 508 compliant?

Are additional approvals needed for this change? If so, please mention them below:

Are there potential vulnerabilities or licensing issues with any new dependencies introduced? If so, please mention them below:

@jessicarowell jessicarowell added this to the v4.1.2 milestone Jan 11, 2025
@jessicarowell
Copy link
Collaborator Author

For testing purposes, make sure to test these two things:

  1. Try adding/not adding info to host_sex, host_age, race, and ethnicity and toggle the keep_demographic_info param in nextflow.config. Check the resulting metadata tsv files and the logs and make sure they're correct.
  2. Try different collection_date formats and different options for the date_format param. Check the tsv and logs and make sure it's changing (or not changing) the date format according to the date_format param selection.

@RamiyapriyaS
Copy link
Collaborator

Testing with a modified metadata file rsv_test_metadata.xlsx

Command:

nextflow run main.nf -profile test,singularity --species rsv --annotation --submission --output_dir sample_name_check --submission_config ./tostadas/conf/submission_config.yaml --meta_path ./assets/sample_metadata/rsv_test_metadata.xlsx

Error message:

executor >  local (2)
[5b/3939e5] process > TOSTADAS_WORKFLOW:TOSTADAS:VALIDATE_PARAMS                      [100%] 1 of 1, failed: 1 ✘
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:METADATA_VALIDATION                  -
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:RUN_VADR:VADR_TRIM                   -
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:RUN_VADR:VADR_ANNOTATION             -
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:RUN_VADR:VADR_POST_CLEANUP           -
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:GET_WAIT_TIME                        -
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:INITIAL_SUBMISSION:SUBMISSION        -
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:INITIAL_SUBMISSION:WAIT              -
[-        ] process > TOSTADAS_WORKFLOW:TOSTADAS:INITIAL_SUBMISSION:UPDATE_SUBMISSION -
WARN: Access to undefined parameter `val_keep_pi` -- Initialise it to a default value eg. `params.val_keep_pi = some_value`
WARN: Access to undefined parameter `val_date_format_flag` -- Initialise it to a default value eg. `params.val_date_format_flag = some_value`
ERROR ~ Error executing process > 'TOSTADAS_WORKFLOW:TOSTADAS:VALIDATE_PARAMS'

Caused by:
  assert params.val_date_format_flag == 's' || params.val_date_format_flag == 'o' || params.val_date_format_flag == 'v'
       |      |                           |  |      |                           |  |      |
       |      null                        |  |      null                        |  |      null
       |                                  |  |                                  |  ['schema':'nextflow_schema.json', 'validate_params':true, 'ref_fasta_path':'/scicomp/home-pure/rjd0/tostadas/assets/ref/Human_orthopneumovirus_NC_001781.fasta', 'meta_path':'/scicomp/home-pure/rjd0/tostadas/assets/sample_metadata/rsv_test_metadata.xlsx', 'ref_gff_path':'/scicomp/home-pure/rjd0/tostadas/assets/ref/ref.MPXV.NC063383.v7.gff', 'date_format_flag':'s', 'keep_demographic_info':false, 'validate_custom_fields':false, 'custom_fields_file':'/scicomp/home-pure/rjd0/tostadas/assets/custom_meta_fields/example_custom_fields.json', 'annotation':true, 'repeatmasker_liftoff':true, 'vadr':true, 'bakta':false, 'species':'rsv', 'submission':true, 'output_dir':'sample_name_check', 'submission_config':'/scicomp/home-pure/rjd0/tostadas/conf/submission_config.yaml', 'repeat_library':'/scicomp/home-pure/rjd0/tostadas/assets/lib/MPOX_repeats_lib.fasta', 'genbank':true, 'sra':true, 'gisaid':false, 'biosample':true, 'submission_mode':'ftp', 'submission_output_dir':'submission_outputs', 'submission_wait_time':380, 'submission_prod_or_test':'test', 'send_submission_email':false, 'update_submission':false, 'help':false, 'publish_dir_mode':'copy', 'bakta_output_dir':'bakta_outputs', 'vadr_output_dir':'vadr_outputs', 'final_liftoff_output_dir':'liftoff_outputs', 'val_output_dir':'validation_outputs', 'vadr_models_dir':'/scicomp/home-pure/rjd0/tostadas/vadr_files/rsv-models', 'env_yml':'/scicomp/home-pure/rjd0/tostadas/environment.yml', 'enable_conda':false, 'repeatmasker_env_yml':'/scicomp/home-pure/rjd0/tostadas/environments/repeatmasker_env.yml', 'vadr_env_yml':'/scicomp/home-pure/rjd0/tostadas/environments/vadr_env.yml', 'cleanup':false, 'clear_nextflow_log':false, 'clear_work_dir':false, 'clear_conda_env':false, 'clear_nf_results':false, 'overwrite_output':true, 'bakta_db_type':'light', 'download_bakta_db':false, 'bakta_db_path':'', 'bakta_min_contig_length':200, 'bakta_threads':2, 'bakta_gram':'?', 'bakta_genus':'Genus', 'bakta_species':'species', 'bakta_strain':'strain', 'bakta_plasmid':'unnamed', 'bakta_locus':'contig', 'bakta_locus_tag':'LOCUSTAG123', 'bakta_translation_table':11, 'bakta_complete':'', 'bakta_keep_contig_headers':'', 'bakta_replicons':'', 'bakta_proteins':'', 'bakta_skip_trna':'', 'bakta_skip_tmrna':'', 'bakta_skip_rrna':'', 'bakta_skip_ncrna':'', 'bakta_skip_ncrna_region':'', 'bakta_skip_crispr':'', 'bakta_skip_cds':'', 'bakta_skip_pseudo':'', 'bakta_skip_sorf':'', 'bakta_skip_gap':'', 'bakta_skip_ori':'', 'bakta_compliant':true, 'bakta_skip_plot':true, 'lift_print_version_exit':false, 'lift_print_help_exit':false, 'lift_parallel_processes':8, 'lift_coverage_threshold':0.5, 'lift_child_feature_align_threshold':0.5, 'lift_unmapped_features_file_name':'output.unmapped_features.txt', 'lift_copy_threshold':1.0, 'lift_distance_scaling_factor':2.0, 'lift_flank':0.0, 'lift_overlap':0.1, 'lift_mismatch':2, 'lift_gap_open':2, 'lift_gap_extend':1, 'lift_minimap_path':'N/A', 'lift_feature_database_name':'N/A', 'lift_feature_types':'/scicomp/home-pure/rjd0/tostadas/assets/feature_types.txt', 'processed_samples':'/scicomp/home-pure/rjd0/tostadas/test_output/submission_outputs']
       |                                  |  |                                  false
       |                                  |  ['schema':'nextflow_schema.json', 'validate_params':true, 'ref_fasta_path':'/scicomp/home-pure/rjd0/tostadas/assets/ref/Human_orthopneumovirus_NC_001781.fasta', 'meta_path':'/scicomp/home-pure/rjd0/tostadas/assets/sample_metadata/rsv_test_metadata.xlsx', 'ref_gff_path':'/scicomp/home-pure/rjd0/tostadas/assets/ref/ref.MPXV.NC063383.v7.gff', 'date_format_flag':'s', 'keep_demographic_info':false, 'validate_custom_fields':false, 'custom_fields_file':'/scicomp/home-pure/rjd0/tostadas/assets/custom_meta_fields/example_custom_fields.json', 'annotation':true, 'repeatmasker_liftoff':true, 'vadr':true, 'bakta':false, 'species':'rsv', 'submission':true, 'output_dir':'sample_name_check', 'submission_config':'/scicomp/home-pure/rjd0/tostadas/conf/submission_config.yaml', 'repeat_library':'/scicomp/home-pure/rjd0/tostadas/assets/lib/MPOX_repeats_lib.fasta', 'genbank':true, 'sra':true, 'gisaid':false, 'biosample':true, 'submission_mode':'ftp', 'submission_output_dir':'submission_outputs', 'submission_wait_time':380, 'submission_prod_or_test':'test', 'send_submission_email':false, 'update_submission':false, 'help':false, 'publish_dir_mode':'copy', 'bakta_output_dir':'bakta_outputs', 'vadr_output_dir':'vadr_outputs', 'final_liftoff_output_dir':'liftoff_outputs', 'val_output_dir':'validation_outputs', 'vadr_models_dir':'/scicomp/home-pure/rjd0/tostadas/vadr_files/rsv-models', 'env_yml':'/scicomp/home-pure/rjd0/tostadas/environment.yml', 'enable_conda':false, 'repeatmasker_env_yml':'/scicomp/home-pure/rjd0/tostadas/environments/repeatmasker_env.yml', 'vadr_env_yml':'/scicomp/home-pure/rjd0/tostadas/environments/vadr_env.yml', 'cleanup':false, 'clear_nextflow_log':false, 'clear_work_dir':false, 'clear_conda_env':false, 'clear_nf_results':false, 'overwrite_output':true, 'bakta_db_type':'light', 'download_bakta_db':false, 'bakta_db_path':'', 'bakta_min_contig_length':200, 'bakta_threads':2, 'bakta_gram':'?', 'bakta_genus':'Genus', 'bakta_species':'species', 'bakta_strain':'strain', 'bakta_plasmid':'unnamed', 'bakta_locus':'contig', 'bakta_locus_tag':'LOCUSTAG123', 'bakta_translation_table':11, 'bakta_complete':'', 'bakta_keep_contig_headers':'', 'bakta_replicons':'', 'bakta_proteins':'', 'bakta_skip_trna':'', 'bakta_skip_tmrna':'', 'bakta_skip_rrna':'', 'bakta_skip_ncrna':'', 'bakta_skip_ncrna_region':'', 'bakta_skip_crispr':'', 'bakta_skip_cds':'', 'bakta_skip_pseudo':'', 'bakta_skip_sorf':'', 'bakta_skip_gap':'', 'bakta_skip_ori':'', 'bakta_compliant':true, 'bakta_skip_plot':true, 'lift_print_version_exit':false, 'lift_print_help_exit':false, 'lift_parallel_processes':8, 'lift_coverage_threshold':0.5, 'lift_child_feature_align_threshold':0.5, 'lift_unmapped_features_file_name':'output.unmapped_features.txt', 'lift_copy_threshold':1.0, 'lift_distance_scaling_factor':2.0, 'lift_flank':0.0, 'lift_overlap':0.1, 'lift_mismatch':2, 'lift_gap_open':2, 'lift_gap_extend':1, 'lift_minimap_path':'N/A', 'lift_feature_database_name':'N/A', 'lift_feature_types':'/scicomp/home-pure/rjd0/tostadas/assets/feature_types.txt', 'processed_samples':'/scicomp/home-pure/rjd0/tostadas/test_output/submission_outputs']
       |                                  false
       ['schema':'nextflow_schema.json', 'validate_params':true, 'ref_fasta_path':'/scicomp/home-pure/rjd0/tostadas/assets/ref/Human_orthopneumovirus_NC_001781.fasta', 'meta_path':'/scicomp/home-pure/rjd0/tostadas/assets/sample_metadata/rsv_test_metadata.xlsx', 'ref_gff_path':'/scicomp/home-pure/rjd0/tostadas/assets/ref/ref.MPXV.NC063383.v7.gff', 'date_format_flag':'s', 'keep_demographic_info':false, 'validate_custom_fields':false, 'custom_fields_file':'/scicomp/home-pure/rjd0/tostadas/assets/custom_meta_fields/example_custom_fields.json', 'annotation':true, 'repeatmasker_liftoff':true, 'vadr':true, 'bakta':false, 'species':'rsv', 'submission':true, 'output_dir':'sample_name_check', 'submission_config':'/scicomp/home-pure/rjd0/tostadas/conf/submission_config.yaml', 'repeat_library':'/scicomp/home-pure/rjd0/tostadas/assets/lib/MPOX_repeats_lib.fasta', 'genbank':true, 'sra':true, 'gisaid':false, 'biosample':true, 'submission_mode':'ftp', 'submission_output_dir':'submission_outputs', 'submission_wait_time':380, 'submission_prod_or_test':'test', 'send_submission_email':false, 'update_submission':false, 'help':false, 'publish_dir_mode':'copy', 'bakta_output_dir':'bakta_outputs', 'vadr_output_dir':'vadr_outputs', 'final_liftoff_output_dir':'liftoff_outputs', 'val_output_dir':'validation_outputs', 'vadr_models_dir':'/scicomp/home-pure/rjd0/tostadas/vadr_files/rsv-models', 'env_yml':'/scicomp/home-pure/rjd0/tostadas/environment.yml', 'enable_conda':false, 'repeatmasker_env_yml':'/scicomp/home-pure/rjd0/tostadas/environments/repeatmasker_env.yml', 'vadr_env_yml':'/scicomp/home-pure/rjd0/tostadas/environments/vadr_env.yml', 'cleanup':false, 'clear_nextflow_log':false, 'clear_work_dir':false, 'clear_conda_env':false, 'clear_nf_results':false, 'overwrite_output':true, 'bakta_db_type':'light', 'download_bakta_db':false, 'bakta_db_path':'', 'bakta_min_contig_length':200, 'bakta_threads':2, 'bakta_gram':'?', 'bakta_genus':'Genus', 'bakta_species':'species', 'bakta_strain':'strain', 'bakta_plasmid':'unnamed', 'bakta_locus':'contig', 'bakta_locus_tag':'LOCUSTAG123', 'bakta_translation_table':11, 'bakta_complete':'', 'bakta_keep_contig_headers':'', 'bakta_replicons':'', 'bakta_proteins':'', 'bakta_skip_trna':'', 'bakta_skip_tmrna':'', 'bakta_skip_rrna':'', 'bakta_skip_ncrna':'', 'bakta_skip_ncrna_region':'', 'bakta_skip_crispr':'', 'bakta_skip_cds':'', 'bakta_skip_pseudo':'', 'bakta_skip_sorf':'', 'bakta_skip_gap':'', 'bakta_skip_ori':'', 'bakta_compliant':true, 'bakta_skip_plot':true, 'lift_print_version_exit':false, 'lift_print_help_exit':false, 'lift_parallel_processes':8, 'lift_coverage_threshold':0.5, 'lift_child_feature_align_threshold':0.5, 'lift_unmapped_features_file_name':'output.unmapped_features.txt', 'lift_copy_threshold':1.0, 'lift_distance_scaling_factor':2.0, 'lift_flank':0.0, 'lift_overlap':0.1, 'lift_mismatch':2, 'lift_gap_open':2, 'lift_gap_extend':1, 'lift_minimap_path':'N/A', 'lift_feature_database_name':'N/A', 'lift_feature_types':'/scicomp/home-pure/rjd0/tostadas/assets/feature_types.txt', 'processed_samples':'/scicomp/home-pure/rjd0/tostadas/test_output/submission_outputs'] -- Check script './workflows/../modules/local/general_util/validate_params/main.nf' at line: 126

Source block:
  assert params.meta_path
  if ( params.annotation ) {
              if ( params.repeatmasker_liftoff ) {
                  assert params.ref_fasta_path
                  assert params.ref_fasta_path
                  assert params.ref_gff_path
                  assert params.repeat_library
              }
              if ( params.vadr ) {
                  assert params.vadr_models_dir
              }
              if ( params.bakta ) {
                  if ( !params.download_bakta_db ) {
                      assert params.bakta_db_path
                  }
              }
          }
  if ( params.repeatmasker_liftoff == true ) {
              // Check whether populated or not 
              assert params.lift_parallel_processes == 0 || params.lift_parallel_processes
              assert params.lift_mismatch
              assert params.lift_gap_open
              assert params.lift_gap_extend 
              assert params.lift_print_version_exit == true || params.lift_print_version_exit == false
              assert params.lift_print_help_exit == true || params.lift_print_help_exit == false
  
              // Check data types 
              expected_liftoff_strings = [
                  "lift_minimap_path": params.lift_minimap_path,
                  "lift_feature_database_name": params.lift_feature_database_name  
              ]
  
              expected_liftoff_integers = [
                  "lift_parallel_processes" : params.lift_parallel_processes,
                  "lift_mismatch": params.lift_mismatch,
                  "lift_gap_open": params.lift_gap_open,
                  "lift_gap_extend": params.lift_gap_extend
              ]
  
              expected_liftoff_floats = [
                  "lift_coverage_threshold": params.lift_coverage_threshold,
                  "lift_child_feature_align_threshold": params.lift_child_feature_align_threshold,
                  "lift_copy_threshold": params.lift_copy_threshold,
                  "lift_distance_scaling_factor": params.lift_distance_scaling_factor,
                  "lift_flank": params.lift_flank,
                  "lift_overlap": params.lift_overlap
              ]
  
              expected_liftoff_strings.each { key, value ->
                  if ( expected_liftoff_strings[key] instanceof String == false ) {
                      throw new Exception("Value must be of string type: $value used for $key parameter")
                  }
              }
  
              expected_liftoff_integers.each { key, value ->
                  if ( expected_liftoff_integers[key] instanceof Integer == false ) {
                      throw new Exception("Value must be of integer type: $value used for $key parameter")
                  }
              }
  
              expected_liftoff_floats.each { key, value ->
                  if ( expected_liftoff_floats[key] instanceof Integer == true || expected_liftoff_floats[key] instanceof String == true ) {
                      throw new Exception("Value must be of float type and not integer or string: $value used for $key parameter")
                  }
              } 
          }
  if ( params.bakta == true ) {
              assert params.meta_path
              assert params.bakta_min_contig_length
              assert params.bakta_translation_table
              assert params.bakta_genus
              assert params.bakta_species
              assert params.bakta_strain
              assert params.bakta_plasmid
              assert params.bakta_locus
              assert params.bakta_locus_tag
          }
  assert params.clear_nextflow_log == true || params.clear_nextflow_log == false
  assert params.clear_work_dir == true || params.clear_work_dir == false
  assert params.submission == true || params.submission == false
  assert params.cleanup == true || params.cleanup == false
  assert params.overwrite_output == true || params.overwrite_output == false
  assert params.val_date_format_flag == 's' || params.val_date_format_flag == 'o' || params.val_date_format_flag == 'v'
  assert params.val_keep_pi == true || params.val_keep_pi == false
  expected_strings = [
              "ref_fasta_path": params.ref_fasta_path,
              "ref_gff_path": params.ref_gff_path,
              "meta_path": params.meta_path,
              "output_dir": params.output_dir,   
          ]
  expected_strings.each { key, value ->
              if (!(value instanceof String || value instanceof org.codehaus.groovy.runtime.GStringImpl)) {
                  throw new Exception("Value must be of string type: $value used for $key parameter")
              }
          }

Work dir:
  /scicomp/scratch/rjd0/nextflow/work/5b/3939e5fd09ec728c3927ec2199f782

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

…atest nf-schema and param validation changes
@jessicarowell
Copy link
Collaborator Author

Ok I think I've fixed it by grabbing the latest changes from dev

@jessicarowell
Copy link
Collaborator Author

Just changed keep_demographic_info to remove_demographic_info based on our conversations - I agree the former was confusing. This version also follows best practices.

@jessicarowell
Copy link
Collaborator Author

Date function is fixed!

@jessicarowell jessicarowell modified the milestones: v4.1.2, v4.1.3, v4.1.4 Jan 18, 2025
@RamiyapriyaS RamiyapriyaS merged commit a614b35 into dev Jan 21, 2025
@jessicarowell jessicarowell deleted the ick4-patch-enable-metadata-val-flags branch February 20, 2025 16:25
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Internal] [Bug] Enable date correction and demographic info scrubbing params
2 participants