Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Provide compatibility with HDF5Matrix #99

Merged
merged 14 commits into from
Oct 6, 2022

Conversation

cvanderaa
Copy link
Contributor

Hello,

This PR provides support for matrices stored in HDF5. This is in response to an ongoing issue in QFeatures: rformassspectrometry/QFeatures#171

There are several functionality in MsCoreUtils that do not allow to provide matrices in an other class than matrix. The functionality I currently identified to be adapted (for QFeatures) are aggregation, imputation and normalization.

Discussion

please do not merge! In the current state of the PR, I adapted the code only for aggregation as an example. I would like to discuss several points:

  1. Do you would agree on having HDF5 support for MsCoreUtils?
  2. If yes, do you agree on the implementation so far?
  3. I face one important consideration. Given an HDF5Matrix object as input, should we expect an HDF5Matrix object as output? For the moment, I convert to matrix for convenience so that aggregate_by_{matrix|vector}() systematically return a matrix.
  4. Later on, I can work on imputation and normalization. Can you think of other steps that should be adapted?

bonus

I found a little bug in aggregate_by_matrix() when missing data is present. I solved the issue and also include an na.rm argument as suggested in the example. I can make a separate PR if needed.

@jorainer
Copy link
Member

Thanks for the PR Christophe!

Some of my thoughts (haven't had a close look at the code yet):

  1. I would like to keep the dependencies of the MsCoreUtils as low as possible. We have many packages importing from MsCoreUtils, so I would prefer to have it as lightweight as possible.

  2. Looking at the code I'm a bit puzzled - you're actually not importing anything from HDF5Matrix - all the calls I see are Matrix::. I suppose that HDF5Matrix extends Matrix? Do you actually need to put HDF5Matrix into Suggests? Should also work without that - if you only use HDF5Matrix in the documentation or unit tests you can also put it into Enhances. That could actually resolve the point 1) above. If the user provides an HDF5Matrix as input he/she will have loaded the package anyway, so the namespace (if it's using S4 classes) should be available for MsCoreUtils at that stage anyway.

  3. I would suggest the functions that work on matrices and return matrices to return the same type of matrix: matrix as input, matrix as output, Matrix as input Matrix as output.

@cvanderaa
Copy link
Contributor Author

Thanks for these first thoughts:

  • 3. I'll adapt the code so that input and output class are consistent.
  • 1. and 2. I understand you want to depend on as few packages as possible. Thanks for pointing to the Enhances field, I was not aware of it and indeed I have added it in Suggest because I need it for the unit tests. I agree with you that an HDF5Matrix as input means that the HDF5Array is already loaded and MsCoreUtils is saved from another dependency.

I'll work on these points and will get back once I have something to show.

@cvanderaa
Copy link
Contributor Author

I adapted the points discussed above. I only face an issue with the Enhances field. It seems HDF5Array is no longer accessible for the unit tests.

I would love to get a code review.

@@ -1,3 +1,5 @@
library(HDF5Array)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding this here I would suggest to require the namespace for the package only in the unit tests were you use this package. See for example: here

Also, I would then maybe manually install the HDF5Array in the github action (using BiocManager::install("HDF5Array").

@jorainer
Copy link
Member

I'll try to have a look at the code over the weekend. In the meantime you could maybe try to fix the GH error on Windows with my suggestion in the comment above.

Copy link
Member

@jorainer jorainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Christophe for the PR!

I only have some small comments/requests mostly on the documentation and unit tests.

##' try(aggregate_by_matrix(x, adj, colSumsMat, na.rm = TRUE))
##' colSumsMat
##' c("A", "B"))))
##' aggregate_by_matrix(x, adj, colSumsMat, na.rm = FALSE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the documentation a bit confusing as it is stated that na.rm is not used. Could you maybe expand/fix the documentation and clearly state when na.rm can be used and where not?

tests/testthat/test_aggregate.R Show resolved Hide resolved
Copy link
Member

@jorainer jorainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks!

@cvanderaa cvanderaa marked this pull request as draft October 4, 2022 07:49
@cvanderaa cvanderaa marked this pull request as ready for review October 6, 2022 09:39
@cvanderaa
Copy link
Contributor Author

The PR now enables HDF5 compatibility for aggregation, imputation and normalization. @lgatto would you mind a final review before merging?

One important point that I currently leave aside is that the functions are compatible for aggregation, imputation and normalization, but are far from optimized. In fact, I convert the HDF5Matrix object to matrix at the start of the function and save a new HDF5 dataset with the processed matrix at the end of the function. A cleverer implementation would define new functions specifically for HDF5Matrix objects and handle block chunking using DelayedArray::blockApply(). This will ask quite some time to implement because the chunking will depend on the normalization or imputation method used. If needed I can do this in a new PR, but that will for sure not be ready before Bioc3.17 release.

Copy link
Member

@jorainer jorainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for me. Maybe wait for feedback from @lgatto

@lgatto
Copy link
Member

lgatto commented Oct 6, 2022

Fine by me, thank you @cvanderaa !

@lgatto lgatto merged commit b63d434 into rformassspectrometry:master Oct 6, 2022
@lgatto
Copy link
Member

lgatto commented Oct 6, 2022

Pushed to Bioc.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants