Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

PCA behavior #367

Open
chasemc opened this issue Nov 12, 2024 · 3 comments
Open

PCA behavior #367

chasemc opened this issue Nov 12, 2024 · 3 comments

Comments

@chasemc
Copy link
Member

chasemc commented Nov 12, 2024

Would it be okay to switch:

if n_components > pca_dimensions and pca_dimensions != 0:
logger.debug(
f"Performing decomposition with PCA (seed {seed}): {n_components} to {pca_dimensions} dims"
)
X = PCA(n_components=pca_dimensions, random_state=random_state).fit_transform(X)
# X = PCA(n_components='mle').fit_transform(X)
n_samples, n_components = X.shape

to adapt to a lower pca dimension when there aren't enough contigs/kmers

    if n_components > pca_dimensions and pca_dimensions != 0:
        if n_samples < pca_dimensions:
            logging.warning(f"n_samples ({n_samples}) is less than pca_dimensions ({pca_dimensions}), lowering pca_dimensions to {min(n_samples, n_components)} .")            
            pca_dimensions = min(n_samples, n_components)
        logger.debug(
            f"Performing decomposition with PCA (seed {seed}): {n_components} to {pca_dimensions} dims"
        )
        X = PCA(n_components=pca_dimensions, random_state=random_state).fit_transform(X)
        n_samples, n_components = X.shape
@chasemc
Copy link
Member Author

chasemc commented Nov 12, 2024

To be clear -> as written this would only happen in the instance that there are less "samples" (contigs) than there are PCA dimensions

@jason-c-kwan
Copy link
Collaborator

What would the point be of doing PCA on a dataset with less than 50 contigs before some other dimension reduction technique? I think before making this change there should be some data gathered on whether it is useful or makes a difference.

@chasemc
Copy link
Member Author

chasemc commented Nov 12, 2024

The main reason is so a minimal dataset that doesn't take forever doesn't fail when testing the workflows.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants