Issue/237 #241

NickEdwards7502 · 2024-10-02T07:09:38Z

Major issues and features addressed in this update

VariantSpark's python wrapper has been refactored to create Random Forest models from a standalone class
- Previously, in the non-hail VariantSpark release, the model was initialised and trained from the context of importance analyses. This did not seem appropriate for supporting future releases
- A new scala function was created to return a trained RandomForest model without using hail
- Files updated/created
  - python/varspark/rfmodel.py
  - python/varspark/core.py
  - python/varspark/__init__.py
  - src/main/scala/au/csiro/variantspark/api/GetRFModel.scala
A non-hail export model function was created
- This function now processes trees in batches to remediate OOM errors with very large models
- Files created
  - src/main/scala/au/csiro/variantspark/api/ExportModel.scala
The FeatureSource class, which provides wrapper functionalities for initialising genotype data for model training, has been moved to a standalone class
- For better separation of concerns, this class is now imported to the core python wrapper
- head(nrows, ncols) allows the first n rows and columns to be viewed as a pandas DataFrame
- Files updated/created
  - python/varspark/featuresource.py
  - python/varspark/core.py
  - src/main/scala/au/csiro/variantspark/input/FeatureSource.scala
Covariate support was extended
- Covariate sources can be created from a transposed/non-transposed .csv or .txt files with optional per-feature type specification
- From there, covariates can be unioned with a genotype FeatureSource and passed to the model as training data
- Since Covariates are initialised from the same FeatureSource wrapper class and are also of type RDD[Feature], they also support head()
- Note that feature and covariate sources can be unioned multiple times
- Local FDR was updated to remove non-genotype information from manhattan plotting
- Files updated/created
  - `python/varspark/core.py
  - src/main/scala/au/csiro/variantspark/api/VSContext.scala
  - src/main/scala/au/csiro/variantspark/input/CsvStdFeatureSource.scala
  - src/main/scala/au/csiro/variantspark/input/UnionedFeatureSource.scala
  - python/varspark/lfdrvsnohail.py
Importance analyses were moved to a standalone python wrapper class
- Importance analyses are now created from the context of a random forest model
- Functionality remains largely the same, with a few changes
  - Both important_variables() and variable_importance() are now returned as pandas DataFrames
  - Split counts are now included in the DataFrame returned by variable_importance() (required for Local FDR calculations)
  - Optional parameter precision supports rounding for variable_importance()
  - Optional parameter normalized indicates whether to normalise importances for both functions
- Files updated/created
  - python/varspark/importanceanalysis.py
  - python/varspark/core.py
  - src/main/scala/au/csiro/variantspark/api/ImportanceAnalysis.scala
  - src/main/scala/au/csiro/variantspark/api/AnalyticsFunctions.scala
Move lfdr file to non-hail python directory
- Created function for manhattan plotting lfdr derived p-values
- Files removed/created
  - python/varspark/hail/lfdrvs.py
  - python/varspark/lfdrvs.py
Updated all test cases according to the above changes
- Files updated/removed/created
  - src/test/scala/au/csiro/variantspark/api
    - /CommonPairwiseOperationTest.scala
    - /ImportanceApiTest.scala
  - src/test/scala/au/csiro/variantspark/misc
    - /ReproducibilityTest.scala
    - /CovariateReproducibilityTest.scala
  - src/test/scala/au/csiro/variantspark/test
    - /TestSparkContext.scala
  - python/varspark/test
    - /test_core.py
    - /test_hail.py
    - /test_pvalues_calculation.py
  - src/test/scala/au/csiro/variantspark/work/hail
    - /HailApiApp.scala
Removed all files used exclusively in hail version
- python/varspark/hail
  - __init__.py
  - context.py
  - hail.py
  - methods.py
  - plot.py
- src/main/scala/au/csiro/variantspark/hail/methods
  - RFModel.scala
Removed hail installation from pom.xml

FEAT: Implemented RF class method for fitting the model FEAT: Implemented RF class method for obtaining importance analysis from a fitted RF FEAT: Implemented RF class method for returning oob error FEAT: Implemented RF class method for obtaining FDR from a fitted model FEAT: Implemented RF class method for exporting forest to JSON REFACTOR: Make RF model available at package level CHORE: Added type checking to all methods

REFACTOR: Removed FeatureSource and ImportanceAnalysis classes from core REFACTOR: Added FeatureSource import so features can be returned as a class instantiation

REFACTOR: Removed imp analysis and model training FEAT: Added conversion from feature to RDD (python) FEAT: Added conversion from feature to RDD (scala) CHORE: Added type checking

due to import order warning (#237)

separate wrapper file (#237) REFACTOR: Updated important_variables and variable_importance methods to convert to pandas DataFrames

REFACTOR: Removed model training from object instantation and updated class to accept a model as a parameter REFACTOR: Added normalisation as an optional parameter for variable importance methods FEAT: Updated variableImportance method to include splitCount in return as it is required for local FDR analysis

and passes back to python context (#237)

from importAnalysis method of AnalyticsFunctions (#237)

FIX: Update export function to process trees in batches, instead of collecting the whole forest as a map as this led to OOM errors on large forests

REFACTOR: Refactor to mirror changes to python wrapper FEAT: Include FDR calculation in unit test

FEAT: Implement function for manhattan plotting negative log p values

without hail (#237)

STYLE: Format with black

FEAT: Add wrapper class for importing covariates FEAT: Add wrapper class for unioning features and covariates

REFACTOR: Include covariate filtering in manhattan plot function STYLE: Format with black (#237)

FEAT: Add functions for importing std and transposed CSVs FEAT: Add function for unioning features and covariates

…237)

Reference changed to importTransposedCSV

REFACTOR: Remove python component of converting Feature RDD to pandas FEAT: Add RDD slice to DF function

REFACTOR: Remove conversion of whole RDD to DataFrame FEAT: Add function for slicing rows and columns and converting to DF

* .bgz loader function implemented by Christina

* Update python wrapper to include imputation strategy parameter * Update scala API to pass imputation strategy to VCFFeatureSource * Create functions to handle mode and zero imputation strategies * Added imputation strategy to test cases * Added imputation strategy to FeatureSource cli * Remove sparkPar from test cases due to changes in class signature * Updated DefVariantToFeatureConverterTest to use zeros imputation

NickEdwards7502 added 30 commits September 11, 2024 14:28

DEV: Updated varspark python wrapper (#237)

80a9c59

REFACTOR: Removed FeatureSource and ImportanceAnalysis classes from core REFACTOR: Added FeatureSource import so features can be returned as a class instantiation

DEV: Created standalone FeatureSource class in separate file (#237)

23520ec

REFACTOR: Removed imp analysis and model training FEAT: Added conversion from feature to RDD (python) FEAT: Added conversion from feature to RDD (scala) CHORE: Added type checking

REFACTOR: Remove unecessary hail import for hail rf wrapper

4560998

due to import order warning (#237)

DEV: Created standalone ImportanceAnalysis class in

b8b39fd

separate wrapper file (#237) REFACTOR: Updated important_variables and variable_importance methods to convert to pandas DataFrames

DEV: Created scala function that trains a forest

0fc736f

and passes back to python context (#237)

REFACTOR: Removed model definition and training

ea069d6

from importAnalysis method of AnalyticsFunctions (#237)

DEV: Create no-hail equivalent of JSON model export (#237)

e08f12a

FIX: Update export function to process trees in batches, instead of collecting the whole forest as a map as this led to OOM errors on large forests

REFACTOR: Update importance API test cases to reflect changes (#237)

ddc5912

REFACTOR: Update reproducibility test case to reflect changes (#237)

f6d40d4

DEV: Update python unit testing (#237)

3356d9a

REFACTOR: Refactor to mirror changes to python wrapper FEAT: Include FDR calculation in unit test

DEV: Create no hail lfdr class (#237)

59f40bc

FEAT: Implement function for manhattan plotting negative log p values

DEV: Create temp hail notebook for testing JSON export OOM (#237)

3f8066b

DEV: Create temp notebook for demonstrating VS functionality

de29b45

without hail (#237)

DEV: Add covariate import wrapper function (#237)

fe2db4c

STYLE: Format with black

DEV: Create python class for covariate imports (#237)

a9b9570

STYLE: Format with black (#237)

3ea4c8c

REFACTOR: Remove covariatesource as not required (#237)

8f11e62

STYLE: Format with black (#237)

d671f35

DEV: Add wrapper functions for covariate support (#237)

209a463

FEAT: Add wrapper class for importing covariates FEAT: Add wrapper class for unioning features and covariates

STYLE: Format with black (#237)

b94afcc

STYLE: Format with black (#237)

04daae2

DEV: Update lfdr to support covariates (#237)

30732ba

REFACTOR: Include covariate filtering in manhattan plot function STYLE: Format with black (#237)

STYLE: Format with scalamft (#237)

3381e68

DEV: Update VSContext to support covariates (#237)

37f4193

FEAT: Add functions for importing std and transposed CSVs FEAT: Add function for unioning features and covariates

DEV: Update std CSV features to support optional variable type specs (#…

dfae3c2

…237)

DEV: Create class for returning union of features and covariates (#237)

9733844

CHORE: Add reproducibility test that includes covariates (#237)

dd32e0f

CHORE: Remove print statements from RF reproducibility test (#237)

769ce76

NickEdwards7502 added 19 commits September 19, 2024 15:43

DEV: Update fdr estimation to return p-value cutoff (#237)

4379b0a

FIX: Correct significance line plot to use cutoff p-value (#237)

d2048d0

REFACTOR: Update pairwise operation tests based on import changes (#237)

2416b8e

Reference changed to importTransposedCSV

CHORE: Update nohail notebook with significance line changes (#237)

1529bd8

FIX: Add covariate type parameter to python wrapper for std csv (#237)

12d6137

DEV: Replace to_df functionality with head (#237)

4df6e32

REFACTOR: Remove python component of converting Feature RDD to pandas FEAT: Add RDD slice to DF function

DEV: Update FeatureSource dataframe conversion (#237)

b1fe760

REFACTOR: Remove conversion of whole RDD to DataFrame FEAT: Add function for slicing rows and columns and converting to DF

CHORE: Update nohail demo to include df slicing (#237)

4506139

REFACTOR: Make ExportModel compatible with JsonRFAnalyser (#237)

07cd144

REFACTOR: Make head concrete & inherited by all feature sources (#237)

cef13f4

FIX: Update regex import for lfdr DataFrame parsing(#237)

b6f4e3b

CHORE: Remove cutoff pvalue print statement (#237)

9b7f83b

CHORE: Remove VSHail python wrapper files (#237)

f66f445

CHORE: Remove VSHail Random Forest scala class (#237)

e6e637c

CHORE: Update spark context test to use non-hail Kryo registrator (#237)

8f82d29

CHORE: Remove test cases specifically for hail (#237)

1be8d66

CHORE: Remove hail from maven build (#237)

bd53f27

REFACTOR: Update lfdr filename and python wrapper reference (#237)

45ae200

CHORE: Remove hail unit tests (#237)

fe70285

NickEdwards7502 added enhancement dependencies Pull requests that update a dependency file java Pull requests that update Java code python Pull requests that update Python code labels Oct 2, 2024

NickEdwards7502 requested a review from bhosking October 2, 2024 07:09

NickEdwards7502 self-assigned this Oct 2, 2024

NickEdwards7502 marked this pull request as ready for review October 2, 2024 07:10

NickEdwards7502 added 3 commits October 17, 2024 16:05

FIX: Update LocalFDR import statement in rfmodel.py (#237)

279bd5b

DEV: Integrate bgzipped file support in VCF import API (#237)

5ad8cc0

* .bgz loader function implemented by Christina

NickEdwards7502 force-pushed the issue/237 branch from a5673e5 to b686d75 Compare October 17, 2024 06:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue/237 #241

Issue/237 #241

NickEdwards7502 commented Oct 2, 2024

Issue/237 #241

Are you sure you want to change the base?

Issue/237 #241

Conversation

NickEdwards7502 commented Oct 2, 2024