[python] `to_dataframe` does not produce sparse data frames #808

cdiener · 2019-03-06T23:27:52Z

Hi,

I noticed that the pandas.SparseDataFrame returned by Table.to_dataframe is not really sparse. For instance for the American Gut data:

In [15]: bm = load_table("deblur_125nt_no_blooms.biom")

In [16]: bm
Out[16]: 32954 x 9511 <class 'biom.table.Table'> with 1829490 nonzero entries (0% dense)

In [17]: tab = bm.to_dataframe()

In [19]: type(tab)
Out[19]: pandas.core.sparse.frame.SparseDataFrame

In [20]: tab.density
Out[20]: 1.0

In [21]: tab.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
Index: 32954 entries, AACGTAGGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGAAGGCTAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGGTCATCTAGAGTG to TACGGGGGATGCGAGCGTTATCCGGATTCATTGGGTTTAAAGGGTGCGCAGGCCGAGGTTCAAGTCAGCGGTGAAACCCCCGCGCTCAACGCGGGGCATGCCGTTGATACTGTATCTCTGGAGTA
Columns: 9511 entries, 10317.000012326 to 10317.000038478
dtypes: Sparse[float64, nan](9511)
memory usage: 2.3+ GB

This is basically the memory use of the full table including zeros. Also the densities of the original table and the SparseDataTable are pretty different (~0% vs 100%).

The text was updated successfully, but these errors were encountered:

wasade · 2019-03-06T23:55:39Z

Interesting. So, unlike scipy.sparse, pandas expects empty values to be nan and not zero. That's super annoying.

Do you have a fix by chance?

In [1]: import biom

In [2]: print(biom.example_table)
# Constructed from biom file
#OTU ID	S1	S2	S3
O1	0.0	1.0	2.0
O2	3.0	4.0	5.0

In [3]: biom.example_table.to_dataframe()
Out[3]:
     S1   S2   S3
O1  0.0  1.0  2.0
O2  3.0  4.0  5.0

In [4]: biom.example_table.to_dataframe().info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
Index: 2 entries, O1 to O2
Data columns (total 3 columns):
S1    2 non-null float64
S2    2 non-null float64
S3    2 non-null float64
dtypes: float64(3)
memory usage: 64.0+ bytes

In [5]: biom.example_table.to_dataframe(dense=True)
Out[5]:
     S1   S2   S3
O1  0.0  1.0  2.0
O2  3.0  4.0  5.0

In [6]: biom.example_table.to_dataframe(dense=True).to_sparse()
Out[6]:
     S1   S2   S3
O1  0.0  1.0  2.0
O2  3.0  4.0  5.0

In [7]: biom.example_table.to_dataframe(dense=True).to_sparse().info
Out[7]:
<bound method DataFrame.info of      S1   S2   S3
O1  0.0  1.0  2.0
O2  3.0  4.0  5.0>

In [8]: biom.example_table.to_dataframe(dense=True).to_sparse().info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
Index: 2 entries, O1 to O2
Data columns (total 3 columns):
S1    2 non-null float64
S2    2 non-null float64
S3    2 non-null float64
dtypes: float64(3)
memory usage: 64.0+ bytes

In [9]: biom.example_table.to_dataframe(dense=True).to_sparse().density
Out[9]: 1.0

cdiener · 2019-03-07T01:32:33Z

I think you would just have to set fill_value = 0.0 in the SparseSeries constructor. I can try with a PR if you'd like.

wasade · 2019-03-07T01:56:27Z

That would be wonderful, thank you!

cdiener mentioned this issue Mar 7, 2019

make Table.to_dataframe create real sparse frames #809

Merged

wasade closed this as completed Mar 8, 2019

fedarko mentioned this issue Apr 30, 2019

Store sparse count data JSON in visualization biocore/qurro#58

Closed

fedarko mentioned this issue May 15, 2019

train/test split is inefficient biocore/mmvec#44

Open

fedarko mentioned this issue Feb 16, 2020

Update BIOM version required biocore/qurro#272

Closed

fedarko mentioned this issue Feb 23, 2020

Make to_dataframe use the new Sparse data structures from pandas >= 0.25 #838

Closed

fedarko mentioned this issue Jul 5, 2022

Update Qurro to support pandas v1 and up, and thus newer versions of QIIME 2 biocore/qurro#322

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] `to_dataframe` does not produce sparse data frames #808

[python] `to_dataframe` does not produce sparse data frames #808

cdiener commented Mar 6, 2019

wasade commented Mar 6, 2019 •

edited

Loading

cdiener commented Mar 7, 2019

wasade commented Mar 7, 2019

[python] to_dataframe does not produce sparse data frames #808

[python] to_dataframe does not produce sparse data frames #808

Comments

cdiener commented Mar 6, 2019

wasade commented Mar 6, 2019 • edited Loading

cdiener commented Mar 7, 2019

wasade commented Mar 7, 2019

[python] `to_dataframe` does not produce sparse data frames #808

[python] `to_dataframe` does not produce sparse data frames #808

wasade commented Mar 6, 2019 •

edited

Loading