Skip to content

Issue/237 #241

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 53 commits into
base: dev
Choose a base branch
from
Open

Issue/237 #241

wants to merge 53 commits into from

Conversation

NickEdwards7502
Copy link
Collaborator

Major issues and features addressed in this update

  • VariantSpark's python wrapper has been refactored to create Random Forest models from a standalone class

    • Previously, in the non-hail VariantSpark release, the model was initialised and trained from the context of importance analyses. This did not seem appropriate for supporting future releases
    • A new scala function was created to return a trained RandomForest model without using hail
    • Files updated/created
      • python/varspark/rfmodel.py
      • python/varspark/core.py
      • python/varspark/__init__.py
      • src/main/scala/au/csiro/variantspark/api/GetRFModel.scala
  • A non-hail export model function was created

    • This function now processes trees in batches to remediate OOM errors with very large models
    • Files created
      • src/main/scala/au/csiro/variantspark/api/ExportModel.scala
  • The FeatureSource class, which provides wrapper functionalities for initialising genotype data for model training, has been moved to a standalone class

    • For better separation of concerns, this class is now imported to the core python wrapper
    • head(nrows, ncols) allows the first n rows and columns to be viewed as a pandas DataFrame
    • Files updated/created
      • python/varspark/featuresource.py
      • python/varspark/core.py
      • src/main/scala/au/csiro/variantspark/input/FeatureSource.scala
  • Covariate support was extended

    • Covariate sources can be created from a transposed/non-transposed .csv or .txt files with optional per-feature type specification
    • From there, covariates can be unioned with a genotype FeatureSource and passed to the model as training data
    • Since Covariates are initialised from the same FeatureSource wrapper class and are also of type RDD[Feature], they also support head()
    • Note that feature and covariate sources can be unioned multiple times
    • Local FDR was updated to remove non-genotype information from manhattan plotting
    • Files updated/created
      • `python/varspark/core.py
      • src/main/scala/au/csiro/variantspark/api/VSContext.scala
      • src/main/scala/au/csiro/variantspark/input/CsvStdFeatureSource.scala
      • src/main/scala/au/csiro/variantspark/input/UnionedFeatureSource.scala
      • python/varspark/lfdrvsnohail.py
  • Importance analyses were moved to a standalone python wrapper class

    • Importance analyses are now created from the context of a random forest model
    • Functionality remains largely the same, with a few changes
      • Both important_variables() and variable_importance() are now returned as pandas DataFrames
      • Split counts are now included in the DataFrame returned by variable_importance() (required for Local FDR calculations)
      • Optional parameter precision supports rounding for variable_importance()
      • Optional parameter normalized indicates whether to normalise importances for both functions
    • Files updated/created
      • python/varspark/importanceanalysis.py
      • python/varspark/core.py
      • src/main/scala/au/csiro/variantspark/api/ImportanceAnalysis.scala
      • src/main/scala/au/csiro/variantspark/api/AnalyticsFunctions.scala
  • Move lfdr file to non-hail python directory

    • Created function for manhattan plotting lfdr derived p-values
    • Files removed/created
      • python/varspark/hail/lfdrvs.py
      • python/varspark/lfdrvs.py
  • Updated all test cases according to the above changes

    • Files updated/removed/created
      • src/test/scala/au/csiro/variantspark/api
        • /CommonPairwiseOperationTest.scala
        • /ImportanceApiTest.scala
      • src/test/scala/au/csiro/variantspark/misc
        • /ReproducibilityTest.scala
        • /CovariateReproducibilityTest.scala
      • src/test/scala/au/csiro/variantspark/test
        • /TestSparkContext.scala
      • python/varspark/test
        • /test_core.py
        • /test_hail.py
        • /test_pvalues_calculation.py
      • src/test/scala/au/csiro/variantspark/work/hail
        • /HailApiApp.scala
  • Removed all files used exclusively in hail version

    • python/varspark/hail
      • __init__.py
      • context.py
      • hail.py
      • methods.py
      • plot.py
    • src/main/scala/au/csiro/variantspark/hail/methods
      • RFModel.scala
  • Removed hail installation from pom.xml

FEAT: Implemented RF class method for fitting the model

FEAT: Implemented RF class method for obtaining importance analysis
from a fitted RF

FEAT: Implemented RF class method for returning oob error

FEAT: Implemented RF class method for obtaining FDR
from a fitted model

FEAT: Implemented RF class method for exporting forest to JSON

REFACTOR: Make RF model available at package level

CHORE: Added type checking to all methods
REFACTOR: Removed FeatureSource and
ImportanceAnalysis classes from core

REFACTOR: Added FeatureSource import so features
can be returned as a class instantiation
REFACTOR: Removed imp analysis and model training

FEAT: Added conversion from feature to RDD (python)

FEAT: Added conversion from feature to RDD (scala)

CHORE: Added type checking
separate wrapper file (#237)

REFACTOR: Updated important_variables and variable_importance
methods to convert to pandas DataFrames
REFACTOR: Removed model training from object instantation and
updated class to accept a model as a parameter

REFACTOR: Added normalisation as an optional parameter for
variable importance methods

FEAT: Updated variableImportance method to include splitCount in return as it is required for local FDR analysis
and passes back to python context (#237)
from importAnalysis method of AnalyticsFunctions (#237)
FIX: Update export function to process trees in batches,
instead of collecting the whole forest as a map as this
led to OOM errors on large forests
REFACTOR: Refactor to mirror changes to python wrapper

FEAT: Include FDR calculation in unit test
FEAT: Implement function for manhattan plotting negative log p values
FEAT: Add wrapper class for importing covariates

FEAT: Add wrapper class for unioning features and covariates
REFACTOR: Include covariate filtering in manhattan plot function

STYLE: Format with black (#237)
FEAT: Add functions for importing std and transposed CSVs

FEAT: Add function for unioning features and covariates
REFACTOR: Remove python component of converting Feature RDD to pandas

FEAT: Add RDD slice to DF function
REFACTOR: Remove conversion of whole RDD to DataFrame

FEAT: Add function for slicing rows and columns and converting to DF
@NickEdwards7502 NickEdwards7502 added enhancement dependencies Pull requests that update a dependency file java Pull requests that update Java code python Pull requests that update Python code labels Oct 2, 2024
@NickEdwards7502 NickEdwards7502 self-assigned this Oct 2, 2024
@NickEdwards7502 NickEdwards7502 marked this pull request as ready for review October 2, 2024 07:10
* .bgz loader function implemented by Christina
* Update python wrapper to include imputation strategy parameter

* Update scala API to pass imputation strategy to VCFFeatureSource

* Create functions to handle mode and zero imputation strategies

* Added imputation strategy to test cases

* Added imputation strategy to FeatureSource cli

* Remove sparkPar from test cases due to changes in class signature

* Updated DefVariantToFeatureConverterTest to use zeros imputation
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
dependencies Pull requests that update a dependency file enhancement java Pull requests that update Java code python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant