Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Support pandas v1: Switch from SparseDataFrame to "regular ... DataFrame with sparse values" #258

Closed
fedarko opened this issue Dec 18, 2019 · 4 comments · Fixed by #322
Closed
Assignees
Labels
external issues/bugs with other libraries, frameworks, etc.; might include reproducing an issue minimally good first issue Good for newcomers important Things that are critical for getting Qurro in a working/useful state optimization Making code faster or cleaner

Comments

@fedarko
Copy link
Collaborator

fedarko commented Dec 18, 2019

See https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating. For the time being it looks like running Qurro gives us a bunch of warning messages repeating this, so ... if we can avoid this that'd make the UX a lot nicer.

Addressing this might actually not be that much of a pain -- seems like the only place Sparse structures are explicitly used is in biom_table_to_sparse_df(), and the remainder of references to it are in comments/etc. --

/Users/mfedarko/Dropbox/Work/KnightLab/qurro> grep -ri "Sparse" qurro/*
Binary file qurro/__pycache__/_df_utils.cpython-36.pyc matches
Binary file qurro/__pycache__/generate.cpython-36.pyc matches
Binary file qurro/__pycache__/_rank_utils.cpython-36.pyc matches
qurro/_df_utils.py:def biom_table_to_sparse_df(table, min_row_ct=2, min_col_ct=1):
qurro/_df_utils.py:    """Loads a BIOM table as a pd.SparseDataFrame. Also calls validate_df().
qurro/_df_utils.py:       To get around this, we extract the scipy.sparse.csr_matrix data from the
qurro/_df_utils.py:       BIOM table and directly convert that to a pandas SparseDataFrame.
qurro/_df_utils.py:    logging.debug("Creating a SparseDataFrame from BIOM table.")
qurro/_df_utils.py:    table_sdf = pd.SparseDataFrame(table.matrix_data, default_fill_value=0.0)
qurro/_df_utils.py:    # in to the SparseDataFrame.
qurro/_df_utils.py:    logging.debug("Converted BIOM table to SparseDataFrame.")
qurro/_df_utils.py:       df_old: pd.DataFrame (or pd.SparseDataFrame)
qurro/_df_utils.py:       df_new: pd.DataFrame (or pd.SparseDataFrame)
qurro/_df_utils.py:       table: pd.DataFrame (or pd.SparseDataFrame)
qurro/_df_utils.py:       (m_table, m_sample_metadata): both pd.[Sparse]DataFrame
qurro/_df_utils.py:    """Returns a "sparse" representation of a dict of counts data.
qurro/_df_utils.py:    sparse_count_dict = {}
qurro/_df_utils.py:        sparse_count_dict[feature_id] = fdict
qurro/_df_utils.py:    return sparse_count_dict
qurro/_df_utils.py:       table_sdf: pd.SparseDataFrame
qurro/_df_utils.py:       table_sdf: pd.DataFrame (or pd.SparseDataFrame)
qurro/_rank_utils.py:    table: pd.SparseDataFrame, ranks: pd.DataFrame, extreme_feature_count: int
qurro/_rank_utils.py:       table: pd.SparseDataFrame
qurro/_rank_utils.py:            A SparseDataFrame representation of a BIOM table. This can be
qurro/_rank_utils.py:            qurro._df_utils.biom_table_to_sparse_df().
qurro/_rank_utils.py:       (table, ranks): (pandas.SparseDataFrame, pandas.DataFrame)
qurro/_rank_utils.py:    # >>> df = pd.SparseDataFrame(np.zeros(34000000).reshape(17000, 2000))
qurro/generate.py:    biom_table_to_sparse_df,
qurro/generate.py:       3. Converts the BIOM table to a SparseDataFrame by calling
qurro/generate.py:          biom_table_to_sparse_df().
qurro/generate.py:       output_table: pd.SparseDataFrame
qurro/generate.py:    table = biom_table_to_sparse_df(biom_table)
qurro/generate.py:    table_sdf: pd.SparseDataFrame
qurro/support_files/js/display.js:        /* Gets count data from the featureCts object. This uses a sparse
qurro/tests/test_filter_unextreme_features.py:from qurro.generate import biom_table_to_sparse_df, process_input
qurro/tests/test_filter_unextreme_features.py:    # ...And yeah we're actually making it into a Sparse DF because that's what
qurro/tests/test_filter_unextreme_features.py:    output_table = biom_table_to_sparse_df(biom_table)
qurro/tests/test_df_utils.py:    # Test that it works even when the data is totally sparse
Binary file qurro/tests/__pycache__/testing_utilities.cpython-36.pyc matches
Binary file qurro/tests/__pycache__/test_filter_unextreme_features.cpython-36-pytest-5.1.2.pyc matches
qurro/tests/testing_utilities.py:    biom_table_to_sparse_df,
qurro/tests/testing_utilities.py:    # Load the table as a Sparse DF, and then match it up with the sample
qurro/tests/testing_utilities.py:    table = biom_table_to_sparse_df(load_table(biom_table_loc))
qurro/tests/web_tests/tests/test_rrvdisplay_compute_balance.js:                // Check that sparse data is handled properly (i.e. 0s are
qurro/tests/web_tests/instrumented_js/display.js:         */validateSampleID(sampleID){cov_1wpg1oiw7k.f[58]++;cov_1wpg1oiw7k.s[313]++;if(!this.sampleIDs.includes(sampleID)){cov_1wpg1oiw7k.b[54][0]++;cov_1wpg1oiw7k.s[314]++;throw new Error("Invalid sample ID: "+sampleID);}else{cov_1wpg1oiw7k.b[54][1]++;}}/* Gets count data from the featureCts object. This uses a sparse
@fedarko fedarko added administrative Logistical matters that don't have much or anything to do with code backburner Low-priority things that are still good to keep track of labels Dec 18, 2019
@fedarko fedarko self-assigned this Dec 18, 2019
@fedarko fedarko added external issues/bugs with other libraries, frameworks, etc.; might include reproducing an issue minimally optimization Making code faster or cleaner and removed administrative Logistical matters that don't have much or anything to do with code backburner Low-priority things that are still good to keep track of labels Dec 18, 2019
@fedarko
Copy link
Collaborator Author

fedarko commented Dec 18, 2019

issue labels somehow got messed up, huh

@fedarko fedarko added the good first issue Good for newcomers label Dec 18, 2019
@fedarko fedarko added the important Things that are critical for getting Qurro in a working/useful state label Feb 21, 2020
@fedarko
Copy link
Collaborator Author

fedarko commented Feb 21, 2020

Upgrading to important, since we need to get this done for the next pandas release: biocore/songbird#117

@mortonjt
Copy link

mortonjt commented Feb 21, 2020 via email

@ElDeveloper
Copy link
Member

ElDeveloper commented Feb 21, 2020 via email

fedarko added a commit to fedarko/qurro that referenced this issue Feb 29, 2020
@fedarko fedarko changed the title Switch from SparseDataFrame to "regular ... DataFrame with sparse values" Support pandas v1: Switch from SparseDataFrame to "regular ... DataFrame with sparse values" Sep 8, 2020
fedarko added a commit to fedarko/qurro that referenced this issue Jul 5, 2022
See biocore#258 and biocore#315. not confident this is done yet (and if nothing
else the rest of the code gleefully refers to "SparseDataFrame"
because 2019 marcus was a schmuck), but this at least fixes a fair
amount of failing tests
fedarko added a commit to fedarko/qurro that referenced this issue Jul 5, 2022
The problem was using .loc[] on these sparse dataframes. whoops
fedarko added a commit that referenced this issue Oct 20, 2022
…QIIME 2 (#322)

* DEP: Update setup.py re: python and pandas #315

* DEV: port CI from Travis to GH Actions: close #316

* TST: For now, omit "make notebooks" from CI

Maybe we can make another GitHub Actions for these later; but
Songbird is causing tensorflow nonsense to pop up, and this is not
the sort of thing I think we should spend time fixing (esp with
the advent of birdman)

* DEP: pin min biom vsn and add some comments

* DEP: Fix biom_table_to_sparse_df for pandas >= 1

See #258 and #315. not confident this is done yet (and if nothing
else the rest of the code gleefully refers to "SparseDataFrame"
because 2019 marcus was a schmuck), but this at least fixes a fair
amount of failing tests

* DEP: remove some warnings, docs, fix a test re: pd

* TST: Fix the python tests!!! #258

The problem was using .loc[] on these sparse dataframes. whoops

* STY: tiny style fixes

* DEP: knock out some pandas warnings

* DEP: np.matrix() -> np.array() in qarcoal tests

since apparently it's deprecated, or about to be deprecated, idk

* DEP/STY: Fix more warnings; remove unused import

most of these warnings were just pd.DataFrame.append() being
deprecated and replaced with pd.concat()

* DOC: one of the demos' JS data slightly changed

looks like it's a tiny floating-point thing -- probably an artifact
of working here on a new operating system, on a new python version,
a new pandas version, a new biom version, etc. shouldn't make a
noticeable difference

* DOC: update readme re: min Q2 vsn

* TST: matrix of qiime 2 versions

nice!

* TST: more detailed comment about Q2 vsn matrix

* DOC: remove the "Sparse" from "SparseDataFrame"

* REL: version kick

* TST: Add standalone CI

IIRC something about how our specific altair version works makes it
incompatible with python 3.10. let's test that here -- if needed,
we can update the README to disallow python versions >= 3.10. (And
then we can look into removing the altair pin when absolutely needed.)

* TST: attempt to get standalone tests working

* TST: attempt to fix pytest q2 exclusion

* DEP: ok py 3.10 is a no go

* STY: fix formatting

* DOC: Rerun 4 / 6 example notebooks

Songbird and ALDEx2 ones will cause problems

* DOC: tidy/update readme refs

* DOC: update jake fish dataset ref on website

* DOC: Fix songbird notebook!, standardize output rm

* BLD: rm (now-)unused comments from q2 ci

* DOC: fix transcriptomics ntbk :)

* REL: update changelog

* REL: update changelog

* TST: see if we can finagle q2 2020.6 / 2020.8?

since i thiiiink these versions mighta worked with the pandas >= 1
syntax

that being said, i don't think it makes sense to devote time/energy
to officially supporting them; just wanna check

* TST: remove Q2 2020.6 / 2020.8 in CI

Looks like the tests themselves pass for these versions, but the
style-checking with black fails due to incompatibility with click.

yeah this is enough for me to not bother supporting these versions
imo

* DOC: songbird compatibility deets

* DEV/DOC: update dev docs re: 2022

the apocalypse came and all i got was this pull request

* REL: update changelog about updating contributing

about about about about aboot

* REL: minor chglog tidying
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
external issues/bugs with other libraries, frameworks, etc.; might include reproducing an issue minimally good first issue Good for newcomers important Things that are critical for getting Qurro in a working/useful state optimization Making code faster or cleaner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants