Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[python] to_dataframe does not produce sparse data frames #808

Closed
cdiener opened this issue Mar 6, 2019 · 3 comments
Closed

[python] to_dataframe does not produce sparse data frames #808

cdiener opened this issue Mar 6, 2019 · 3 comments

Comments

@cdiener
Copy link
Contributor

cdiener commented Mar 6, 2019

Hi,

I noticed that the pandas.SparseDataFrame returned by Table.to_dataframe is not really sparse. For instance for the American Gut data:

In [15]: bm = load_table("deblur_125nt_no_blooms.biom")

In [16]: bm
Out[16]: 32954 x 9511 <class 'biom.table.Table'> with 1829490 nonzero entries (0% dense)

In [17]: tab = bm.to_dataframe()

In [19]: type(tab)
Out[19]: pandas.core.sparse.frame.SparseDataFrame

In [20]: tab.density
Out[20]: 1.0

In [21]: tab.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
Index: 32954 entries, AACGTAGGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGAAGGCTAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGGTCATCTAGAGTG to TACGGGGGATGCGAGCGTTATCCGGATTCATTGGGTTTAAAGGGTGCGCAGGCCGAGGTTCAAGTCAGCGGTGAAACCCCCGCGCTCAACGCGGGGCATGCCGTTGATACTGTATCTCTGGAGTA
Columns: 9511 entries, 10317.000012326 to 10317.000038478
dtypes: Sparse[float64, nan](9511)
memory usage: 2.3+ GB

This is basically the memory use of the full table including zeros. Also the densities of the original table and the SparseDataTable are pretty different (~0% vs 100%).

@wasade
Copy link
Member

wasade commented Mar 6, 2019

Interesting. So, unlike scipy.sparse, pandas expects empty values to be nan and not zero. That's super annoying.

Do you have a fix by chance?

In [1]: import biom

In [2]: print(biom.example_table)
# Constructed from biom file
#OTU ID	S1	S2	S3
O1	0.0	1.0	2.0
O2	3.0	4.0	5.0

In [3]: biom.example_table.to_dataframe()
Out[3]:
     S1   S2   S3
O1  0.0  1.0  2.0
O2  3.0  4.0  5.0

In [4]: biom.example_table.to_dataframe().info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
Index: 2 entries, O1 to O2
Data columns (total 3 columns):
S1    2 non-null float64
S2    2 non-null float64
S3    2 non-null float64
dtypes: float64(3)
memory usage: 64.0+ bytes

In [5]: biom.example_table.to_dataframe(dense=True)
Out[5]:
     S1   S2   S3
O1  0.0  1.0  2.0
O2  3.0  4.0  5.0

In [6]: biom.example_table.to_dataframe(dense=True).to_sparse()
Out[6]:
     S1   S2   S3
O1  0.0  1.0  2.0
O2  3.0  4.0  5.0

In [7]: biom.example_table.to_dataframe(dense=True).to_sparse().info
Out[7]:
<bound method DataFrame.info of      S1   S2   S3
O1  0.0  1.0  2.0
O2  3.0  4.0  5.0>

In [8]: biom.example_table.to_dataframe(dense=True).to_sparse().info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
Index: 2 entries, O1 to O2
Data columns (total 3 columns):
S1    2 non-null float64
S2    2 non-null float64
S3    2 non-null float64
dtypes: float64(3)
memory usage: 64.0+ bytes

In [9]: biom.example_table.to_dataframe(dense=True).to_sparse().density
Out[9]: 1.0

@cdiener
Copy link
Contributor Author

cdiener commented Mar 7, 2019

I think you would just have to set fill_value = 0.0 in the SparseSeries constructor. I can try with a PR if you'd like.

@wasade
Copy link
Member

wasade commented Mar 7, 2019

That would be wonderful, thank you!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants