Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Rmorin dev #21

Merged
merged 6 commits into from
Jul 13, 2021
Merged

Rmorin dev #21

merged 6 commits into from
Jul 13, 2021

Conversation

rdmorin
Copy link
Collaborator

@rdmorin rdmorin commented Jul 5, 2021

Pull Request Checklists

Important: When opening a pull request, keep only the applicable checklist and delete all other sections.

Checklist for all PRs

Required

  • I tested the new code for my use case (please provide a reproducible example of how you tested the new functionality)

  • I ensured all dplyr functions that commonly conflict with other packages are fully qualified.

This can be checked and addressed by running check_functions.pl and responding to the prompts. Test your code after you do this.

  • I generated the documentation and checked for errors relating to the new function (e.g. devtools::document()) and added NAMESPACE and all other modified files in the root directory and under man.

Optional but preferred with PRs

  • I updated and/or successfully knitted a vignette that relies on the modified code (which ones?)

Checklist for New Functions

Required

  • I documented my function using ROxygen style.)

  • All parameters for the function are described in the documentation and the function has a decriptive title.

Example:

#' Use GISTIC2.0 scores output to reproduce maftools::chromoplot with more flexibility
#'
#' @param scores output file scores.gistic from the run of GISTIC2.0
#' @param genes_to_label optional. Provide a data frame of genes to label (if mutated). The first 3 columns must contain chromosome, start, and end coordinates. Another required column must contain gene names and be named `gene`. (truncated for example)
#' @param cutoff optional. Used to determine which regions to color as aberrant. Must be float in the range [0-1]. (truncated for example)
  • My function uses a library that isn't already a dependency of GAMBLR and I made the package aware of this dependency using the function documentation import statment.

Example:

#' @return nothing
#' @export
#' @import tidyverse ggrepel

Checklist for changes to existing code

  • I added/removed arguments to a function and updated documentation for all changed/new arguments

  • I tested the new code for compatability with existing functionality in the Master branch (please provide a reprex of how you tested the original functionality)

This was tested by Brett for the new functionality. the from_flatfile options work for metadata and coding ssms.

@rdmorin rdmorin requested review from bcollinge and Kdreval and removed request for bcollinge and Kdreval July 5, 2021 21:41
@@ -176,12 +195,12 @@ get_gambl_metadata = function(seq_type_filter = "genome",
TRUE ~ 50
))
if(with_outcomes){
outcome_table = get_gambl_outcomes() %>% dplyr::select(-sex)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like the outcome_table is dropped here, but below (line 199) used in left_join. Is there other place the outcomes are obtained from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this line to the top of the function. It's always called but not always joined. I changed the default to always join it because I see no reason why not to. Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found it now, see it - line 28. Missed it at first

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to always join with outcomes by default, so all are returned at once 👍

R/database.R Outdated
db=config::get("database_name")
con <- DBI::dbConnect(RMariaDB::MariaDB(), dbname = db)
all_outcome = dplyr::tbl(con,"outcome_metadata") %>% as.data.frame()
get_gambl_outcomes = function(patient_ids,time_unit="year",censor_cbioportal=FALSE,complete_missing=FALSE,from_flatfile=FALSE){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default for from_flatfile here is set to FALSE, but in the get_metadata default is TRUE. Do you want to default it to TRUE here as well, so the default values are consistent between functions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't really matter because the only call to get_gambl_outcomes in the code is part of get_gambl_metadata and it passes the desired value of this variable along with it. In the long run I'd actually like to move away from using the database for these tables so I'll make the default false.

con <- DBI::dbConnect(RMariaDB::MariaDB(), dbname = db)
get_coding_ssm = function(limit_cohort,exclude_cohort,
limit_pathology,limit_samples,basic_columns=TRUE,
from_flatfile=FALSE,groups=c("gambl","icgc_dart")){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also I think might be helpful to harmonize the defaults with other functions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly disagree. Easier to explain in a conversation. Keeping the metadata in the database has proven to be very problematic due to the ongoing change in the order and number of columns. I'd like to always use the database for all functions except for metadata so the defaults shouldn't be harmonized.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yes, the metadata is a subject to constant updates. I got it now that defaults for flat files should be different here

biopsy_meta = dplyr::tbl(con,"biopsy_metadata") %>% dplyr::select(-patient_id) %>% dplyr::select(-pathology) %>% dplyr::select(-time_point) %>% dplyr::select(-EBV_status_inf) #drop duplicated columns

biopsy_meta = biopsy_meta %>% dplyr::select(-patient_id) %>% dplyr::select(-pathology) %>% dplyr::select(-time_point) %>% dplyr::select(-EBV_status_inf) #drop duplicated columns

all_meta = dplyr::left_join(sample_meta,biopsy_meta,by="biopsy_id") %>% as.data.frame()
all_meta = all_meta %>% mutate(bcl2_ba=ifelse(bcl2_ba=="POS_BCC","POS",bcl2_ba))
Copy link

@bcollinge bcollinge Jul 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth cleaning up all the FISH columns here?

all_meta = all_meta %>% mutate_at(vars(starts_with(c("myc", "bcl2", "bcl6")) & ends_with(c("_ba", "_cn"))), 
          ~str_remove(., "_.*") %>% str_replace(., "^NORM$|^NOMR$", "NORMAL")) 

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point but I think it should actually be done in GAMBL rather than GAMBLR so the metadata is clean before this package sees it. I suggest discussing how to tackle this with @lkhilton

@rdmorin rdmorin requested a review from lkhilton July 13, 2021 02:46
@rdmorin rdmorin merged commit 438850f into master Jul 13, 2021
@rdmorin rdmorin deleted the rmorin-dev branch July 13, 2021 02:47
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants