-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
randomized svd draft #3008
randomized svd draft #3008
Conversation
@petrelharp Here's the code. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3008 +/- ##
==========================================
+ Coverage 89.92% 89.95% +0.03%
==========================================
Files 29 29
Lines 32362 32477 +115
Branches 5802 5822 +20
==========================================
+ Hits 29101 29216 +115
Misses 1859 1859
Partials 1402 1402
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
This looks great! Very elegant. I think probably we ought to include a So, how about the signature is like
and:
Note that we could be getting PCs for non-sample nodes (since individual's nodes need not be samples); I haven't thought through whether the values you get are correct or informative. My guess is that maybe they are? But we need a "user beware" note for this? |
Ah, sorry - one more thing - does this work with I think the way to do the windows would be something like
Basically - get it to work in the case where |
A simple test case for the
Because of the randomness of the algo, the correlation is not exactly 1, although it's nearly 1 like 0.99995623-ish. |
I just noticed that |
Check results for two windows.
|
Okay; here I've made a good start at the tests. I think everything is working fine; the tests are not passing because (I think) of numerical tolerance. I could just set the numerical tolerance to like 1e-4 and they'd pass, but I think this is flagging a bigger issue - how do we tell we're getting good answers? I ask because currently the tests pass (at default tolerances) for small numbers of samples but not for 30 samples; if I increase
We'd like this to be not a can of worms; I think our goal is to have something that is good enough, and fowards-compatible for an improved method in the future. Notes:
TODO:
|
Just a quick note that I'd be very much in favour of returning a dataclass here rather than a tuple so that the option of returning more information about convergence etc is open. |
There's an adaptive rangefinder algorithm described in Halko et al. (https://arxiv.org/pdf/0909.4061, Algo 4.2). I don't see it implemented in scikit-learn (https://scikit-learn.org/dev/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD). I like Jerome's idea to return a class instead of the result. There's an intermediate matrix |
Not a classobject yet but added random_sketch to the input/output. |
Re the result object, I'd imagined something like @dataclasses.dataclass
class PcaResult:
descriptive_name1: np.ndarray # Or whatever type hints we can get to work
descriptive_name2... |
Now,
A user can continuously improve their estimate through Q.
If the first round did |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, but I would suggest we break the nested functions out to the module level rather than embedding them in the TreeSequence class. The function is currently too long, and it's not clear what needs to be embedded within the function because it's using the namespace, vs what's in there just because. It would be nice to be able to test the bits of this individually, and putting them at the module level will make that possible.
Certainly the return class should be defined at the module level and added to the Sphinx documentation so that it can be linked to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpick about code organisation!
It now has a time-resolved feature. You can select branches within the lower and the upper time limits. It is based on decapitate. |
NICE! |
Also - I wonder if a better name for Oh, I see that scikit-learn says |
I'd vote for compatible with scikit second, compatible with our other APIs first. |
What is the exact failure about? Is it the eigenvalues not matching or the factor scores not matching? In terms of the factor scores, I think it's better to compare |
The failure is that the two answers are Definitely Not the Same. Clicking on the "Tests / ..." link above, or running locally, we get
To run this locally, do
I hear what you're saying about comparing But - I don't think that's the problem - inserting a
|
Good catch!
And - hah - I guess I missed the |
Now all the tests are passing in my local machine. The error was due to a minor omission of the individuals option in the |
Hey, that's great! I need to scrutinize codecov still, but there's one not totally trivial linting issue:
(I get this by running the pre-commit hooks; see the developer docs for how to set this up; in particular by
Would you like to have a look at how to deal with that? (I haven't figured out what it means yet.) |
The linting error appears when a global variable inside a function that is defined within a loop is the loop variable. I can't immediately think of a case where this causes a problem, but the point is that the use of the loop variable inside the function should be explicit. An easy hack is to add the loop variable to the function argument and set it as the default value. This is adapted from : https://stackoverflow.com/questions/54947434/function-definition-in-a-for-loop-using-local-variable |
I don't know why docs build is failing to detect the new PCAResult class. |
@benjeffery any thoughts on why the docs build is failing? If we squash the commits down to a reasonable number (5, say, or must 1 may be easier) then I think we can log an issue to track beefing up the test coverage a bit, and merge. |
Docs build is failing as there is a link to the |
I can't tell how great it is to write code under the guidance of expert maintainers. I added the |
Something funny is going on with codecov: I think the missing line numbers are out of sync with the code. I'd like to at least look at the coverage before merging - let me just squash down the commits, that might help. |
I know, right? I have learned so much from Jerome & Ben. |
Likewise Peter! ❤️ |
add windows feature return range sketch matrix Q support for subset of samples and individuals in a tree lint requires a function inside a loop to be explicit about the loop variable; fixed it by setting a default variable fix docs build error; add PCAResult to python-api.md
Great. Tests cover all parts of the code and are passing. |
Happy to merge when you are @petrelharp |
Let's merge, but there's been no testing so far of the "error bound"; I'd like to get at least a little scrutiny on that before releasing. |
I'll look at it over night. |
Description
A draft of randomized principal component analysis (PCA) using the
TreeSequence.genetic_relatedness_vector
. The implementation containsspicy.sparse
which should eventually be removed.This part of the code is only used when collapsing a
#sample * #sample
GRM into a#individual * #individual
matrix.Therefore, it will not be difficult to replace with pure numpy.
The API was partially taken from scikit-learn.
To add some details,
iterated_power
is the number of power iterations in the range finder in the randomized algorithm. The error of SVD decreases exponentially as a function of this number.The effect of power iteration is profound when the eigen spectrum of the matrix decays slowly, which seems to be the case of tree sequence GRMs in my experience.
indices
specifies the individuals to be included in the PCA, although decreasing the number of individuals does not meaningfully reduce the amount of computation.