Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Perf tweaks #2

Closed
wants to merge 40 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
2a5d4fe
Adding license
mortonjt Jul 16, 2016
2721bf4
Merge pull request #9 from mortonjt/license.txt
mortonjt Jul 16, 2016
b979d41
Last touches before release
mortonjt Jul 16, 2016
1be0fe7
Merge pull request #10 from mortonjt/license.txt
mortonjt Jul 16, 2016
362fe9e
ENH: Adding util functions
mortonjt Jul 19, 2016
a1b9cef
STY: pep8
mortonjt Jul 19, 2016
2908c93
STY: Flake8 to the rescue
mortonjt Jul 19, 2016
e2e2e63
STY: Addressing @antgonza and @josenavas comments
mortonjt Jul 19, 2016
d6a037b
STY: pep8 flake8
mortonjt Jul 19, 2016
38b64c8
GPL license
mortonjt Jul 19, 2016
1cbc6ad
Adding changelog
mortonjt Jul 19, 2016
f79d9a6
Merge pull request #13 from mortonjt/license.txt
antgonza Jul 19, 2016
d197605
DOC: Changing method name
mortonjt Jul 19, 2016
2ded931
STY: clean up pep8/flake8
mortonjt Jul 19, 2016
a54d051
Merge branch 'master' of https://github.com/biocore/gneiss into util
mortonjt Jul 20, 2016
477d196
DOC: Updating changelog
mortonjt Jul 20, 2016
e0c7505
ENH: Adding in niche sorting algorithm
mortonjt Jul 25, 2016
3d9ded6
TST: Adding in length and nan tests
mortonjt Jul 25, 2016
4c8d41d
DOC: typo
mortonjt Jul 25, 2016
371cc16
ENH: Adding inplace option
mortonjt Jul 25, 2016
6d70cf9
STY: Code dereplication
mortonjt Jul 26, 2016
9003db8
DOC: cleaning up docstrings
mortonjt Jul 26, 2016
b049ced
pep8
mortonjt Jul 26, 2016
ff3df22
Merge pull request #12 from mortonjt/util
antgonza Jul 26, 2016
0ed7267
Merge branch 'master' of https://github.com/biocore/gneiss into niche…
mortonjt Jul 26, 2016
ac1832d
FIX: Fixing ordering
mortonjt Jul 26, 2016
42221f5
pep8
mortonjt Jul 26, 2016
23acedf
Adding @josenavas and @antgonza comments
mortonjt Jul 27, 2016
940d7e8
TST: Adding callable error
mortonjt Jul 27, 2016
7ee169c
DOC: Adding raises section
mortonjt Jul 27, 2016
0168b42
DOC: Clarifying formulat in `mean_niche_estimator`
mortonjt Jul 27, 2016
c9e2373
DOC: Correcting formula
mortonjt Jul 27, 2016
8f2d1a7
Merge pull request #16 from mortonjt/nichesort
antgonza Jul 27, 2016
c51b215
Updating changelog
mortonjt Jul 28, 2016
22cbcc3
Merge pull request #19 from mortonjt/nichesort
josenavas Jul 28, 2016
6a247a7
Typo in ipython notebook
mortonjt Jul 29, 2016
f13954d
Fixing typo in ipython notebook
mortonjt Jul 29, 2016
452d62c
Adding another test
mortonjt Aug 4, 2016
ecc88d2
Merge branch 'master' of https://github.com/biocore/gneiss into perf_…
mortonjt Aug 4, 2016
475b96f
Adding additional test cases
mortonjt Aug 4, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# gneiss changelog

## Version 0.0.2 (changes since 0.0.2 go here)

### Features
* * Adding in a niche sorting algorithm `gneiss.sort.niche_sort` that can generate a band table given a gradient [#16](https://github.com/biocore/gneiss/pull/16)
* Adding in utility functions for handing feature tables, metadata, and trees. [#12](https://github.com/biocore/gneiss/pull/12)
* Adding GPL license.

### Bug fixes
674 changes: 674 additions & 0 deletions COPYING.txt

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions gneiss/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# ----------------------------------------------------------------------------
# Copyright (c) 2016--, gneiss development team.
#
# Distributed under the terms of the Modified BSD License.
# Distributed under the terms of the GPLv3 License.
#
# The full license is in the file COPYING.txt, distributed with this software.
# ----------------------------------------------------------------------------

from __future__ import absolute_import, division, print_function


__version__ = "0.0.1"
__version__ = "0.0.2"
8 changes: 8 additions & 0 deletions gneiss/layouts.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
# ----------------------------------------------------------------------------
# Copyright (c) 2016--, gneiss development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file COPYING.txt, distributed with this software.
# ----------------------------------------------------------------------------

from ete3 import faces, AttrFace, CircleFace, BarChartFace


Expand Down
116 changes: 116 additions & 0 deletions gneiss/sort.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# ----------------------------------------------------------------------------
# Copyright (c) 2016--, gneiss development team.
#
# Distributed under the terms of the GPLv3 License.
#
# The full license is in the file COPYING.txt, distributed with this software.
# ----------------------------------------------------------------------------
import numpy as np
import pandas as pd
from functools import partial
from gneiss.util import match


def mean_niche_estimator(abundances, gradient):
""" Estimates the mean niche of an organism.

Calculates the mean niche of an organism along a gradient.
This is done by calculating the expected value of an organism
across the gradient.

Specifically, this module calculates the following

.. math::
E[g | x] =
\sum\limits_{i=1}^N g_i \frac{x_i}{\sum\limits_{j=1}^N x_j}

Where :math:`N` is the number of samples, :math:`x_i` is the proportion of
species :math:`x` in sample :math:`i`, :math:`g_i` is the gradient value
at sample `i`.

Parameters
----------
abundances : pd.Series, np.float
Vector of fraction abundances of an organism over
a list of samples.
gradient : pd.Series, np.float
Vector of numerical gradient values.

Returns
-------
np.float :
The mean gradient that the organism lives in.

Raises
------
ValueError:
If the length of `abundances` is not the same length as `gradient`.
ValueError:
If the length of `gradient` contains nans.
"""
len_abundances = len(abundances)
len_gradient = len(gradient)
if len_abundances != len_gradient:
raise ValueError("Length of `abundances` (%d) doesn't match the length"
" of the `gradient` (%d)" % (len_abundances,
len_gradient))
if np.any(pd.isnull(gradient)):
raise ValueError("`gradient` cannot have any nans.")

# normalizes the proportions of the organism across all of the
# samples to add to 1.
v = abundances / abundances.sum()
m = np.dot(gradient, v)
return m


def niche_sort(table, gradient, niche_estimator=mean_niche_estimator):
""" Sort the table according to estimated niches.

Sorts the table by samples along the gradient
and otus by their estimated niche along the gradient.

Parameters
----------
table : pd.DataFrame
Contingency table where samples are rows and
features (i.e. OTUs) are columns.
gradient : pd.Series
Vector of numerical gradient values.
niche_estimator : function, optional
A function that takes in two pandas series and returns an ordered
object. The ability for the object to be ordered is critical, since
this will allow the table to be sorted according to this ordering.
By default, `mean_niche_estimator` will be used.

Returns
-------
pd.DataFrame :
Sorted table according to the gradient of the samples, and the niches
of the organisms along that gradient.

Raises
------
ValueError :
Raised if `niche_estimator` is not a function.
"""
if not callable(niche_estimator):
raise ValueError("`niche_estimator` is not a function.")

table, gradient = match(table, gradient)

niche_estimator = partial(niche_estimator,
gradient=gradient)

# normalizes feature abundances to sum to 1, for each sample.
# (i.e. scales values in each row to sum to 1).
normtable = table.apply(lambda x: x/x.sum(), axis=1)

# calculates estimated niche for each feature
est_niche = normtable.apply(niche_estimator, axis=0)
gradient = gradient.sort_values()
est_niche = est_niche.sort_values()

table = table.reindex(index=gradient.index,
columns=est_niche.index)
return table
1 change: 1 addition & 0 deletions gneiss/tests/data/large_tree2.nwk

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions gneiss/tests/data/small_tree.nwk
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
(O0:0.316212845735,(O1:0.0544249673249,((((O8:6.17283950617e-05,O9:6.17283950617e-05):0.000396077412446,(O6:0.00015943877551,O7:0.00015943877551):0.000298367031998):0.00183128228143,(O4:0.000555555555556,O5:0.000555555555556):0.00173353253338):0.0105194882737,(O2:0.00347222222222,O3:0.00347222222222):0.00933635414042):0.0416163909622):0.26178787841);
20 changes: 18 additions & 2 deletions gneiss/tests/test_balances.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@
from gneiss.layouts import default_layout
from skbio import TreeNode
from skbio.util import get_data_path

from skbio.stats.composition import _check_orthogonality
from gneiss.util import rename_internal_nodes

class TestPlot(unittest.TestCase):

Expand Down Expand Up @@ -151,15 +152,30 @@ def test_balance_basis_large1(self):
get_data_path('large_tree_basis.txt',
subfolder='data'))
res_basis, res_keys = balance_basis(t)
_check_orthogonality(res_basis)

exp_basis = exp_basis[:, ::-1]
print(exp_basis.shape)
for i in range(len(res_basis)):
print(i)
print(exp_basis[i]- res_basis[i])
npt.assert_allclose(exp_basis[i], res_basis[i])

npt.assert_allclose(exp_basis[:, ::-1], res_basis)
npt.assert_allclose(exp_basis[:, ::-1], res_basis, rtol=1e-5, atol=1e-5)

def test_balance_basis_large2(self):
fname = get_data_path('large_tree2.nwk',
subfolder='data')
t = TreeNode.read(fname)
res_basis, res_keys = balance_basis(t)
_check_orthogonality(res_basis)

def test_balance_basis_small1(self):
fname = get_data_path('small_tree.nwk',
subfolder='data')
t = TreeNode.read(fname)
res_basis, res_keys = balance_basis(t)
_check_orthogonality(res_basis)

if __name__ == "__main__":
unittest.main()
178 changes: 178 additions & 0 deletions gneiss/tests/test_sort.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# ----------------------------------------------------------------------------
# Copyright (c) 2016--, gneiss development team.
#
# Distributed under the terms of the GPLv3 License.
#
# The full license is in the file COPYING.txt, distributed with this software.
# ----------------------------------------------------------------------------
import numpy as np
import pandas as pd
import unittest
from gneiss.sort import niche_sort, mean_niche_estimator
import pandas.util.testing as pdt


class TestSort(unittest.TestCase):
def setUp(self):
pass

def test_mean_niche_estimator1(self):
gradient = pd.Series(
[1, 2, 3, 4, 5],
index=['s1', 's2', 's3', 's4', 's5'])
values = pd.Series(
[1, 1, 0, 0, 0],
index=['s1', 's2', 's3', 's4', 's5'])
m = mean_niche_estimator(values, gradient)
self.assertEqual(m, 1.5)

def test_mean_niche_estimator2(self):
gradient = pd.Series(
[1, 2, 3, 4, 5],
index=['s1', 's2', 's3', 's4', 's5'])
values = pd.Series(
[1, 3, 0, 0, 0],
index=['s1', 's2', 's3', 's4', 's5'])
m = mean_niche_estimator(values, gradient)
self.assertEqual(m, 1.75)

def test_mean_niche_estimator_bad_length(self):
gradient = pd.Series(
[1, 2, 3, 4, 5],
index=['s1', 's2', 's3', 's4', 's5'])
values = pd.Series(
[1, 3, 0, 0, 0, 0],
index=['s1', 's2', 's3', 's4', 's5', 's6'])

with self.assertRaises(ValueError):
mean_niche_estimator(values, gradient)

def test_mean_niche_estimator_missing(self):
gradient = pd.Series(
[1, 2, 3, 4, np.nan],
index=['s1', 's2', 's3', 's4', 's5'])
values = pd.Series(
[1, 3, 0, 0, 0],
index=['s1', 's2', 's3', 's4', 's5'])

with self.assertRaises(ValueError):
mean_niche_estimator(values, gradient)

def test_basic_niche_sort(self):
table = pd.DataFrame(
[[1, 1, 0, 0, 0],
[0, 1, 1, 0, 0],
[0, 0, 1, 1, 0],
[0, 0, 0, 1, 1]],
columns=['s1', 's2', 's3', 's4', 's5'],
index=['o1', 'o2', 'o3', 'o4']).T
gradient = pd.Series(
[1, 2, 3, 4, 5],
index=['s1', 's2', 's3', 's4', 's5'])
res_table = niche_sort(table, gradient)
pdt.assert_frame_equal(table, res_table)

def test_basic_niche_sort_error(self):
table = pd.DataFrame(
[[1, 1, 0, 0, 0],
[0, 1, 1, 0, 0],
[0, 0, 1, 1, 0],
[0, 0, 0, 1, 1]],
columns=['s1', 's2', 's3', 's4', 's5'],
index=['o1', 'o2', 'o3', 'o4']).T
gradient = pd.Series(
[1, 2, 3, 4, 5],
index=['s1', 's2', 's3', 's4', 's5'])
with self.assertRaises(ValueError):
niche_sort(table, gradient, niche_estimator='rawr')

def test_basic_niche_sort_scrambled(self):
# Swap samples s1 and s2 and features o1 and o2 to see if this can
# obtain the original table structure.
table = pd.DataFrame(
[[1, 0, 1, 0, 0],
[1, 1, 0, 0, 0],
[0, 0, 1, 1, 0],
[0, 0, 0, 1, 1]],
columns=['s2', 's1', 's3', 's4', 's5'],
index=['o2', 'o1', 'o3', 'o4']).T

gradient = pd.Series(
[2, 1, 3, 4, 5],
index=['s2', 's1', 's3', 's4', 's5'])

exp_table = pd.DataFrame(
[[1, 1, 0, 0, 0],
[0, 1, 1, 0, 0],
[0, 0, 1, 1, 0],
[0, 0, 0, 1, 1]],
columns=['s1', 's2', 's3', 's4', 's5'],
index=['o1', 'o2', 'o3', 'o4']).T

res_table = niche_sort(table, gradient)

pdt.assert_frame_equal(exp_table, res_table)

def test_basic_niche_sort_lambda(self):
table = pd.DataFrame(
[[1, 1, 0, 0, 0],
[0, 0, 1, 1, 0],
[0, 1, 1, 0, 0],
[0, 0, 0, 1, 1]],
columns=['s1', 's2', 's3', 's4', 's5'],
index=['o1', 'o3', 'o2', 'o4']).T
gradient = pd.Series(
[1, 2, 3, 4, 5],
index=['s1', 's2', 's3', 's4', 's5'])

exp_table = pd.DataFrame(
[[1, 1, 0, 0, 0],
[0, 1, 1, 0, 0],
[0, 0, 1, 1, 0],
[0, 0, 0, 1, 1]],
columns=['s1', 's2', 's3', 's4', 's5'],
index=['o1', 'o2', 'o3', 'o4']).T

def _dumb_estimator(v, gradient):
v[v > 0] = 1
values = v / v.sum()
return np.dot(gradient, values)

res_table = niche_sort(table, gradient,
niche_estimator=_dumb_estimator)
pdt.assert_frame_equal(exp_table, res_table)

def test_basic_niche_sort_immutable(self):
# Swap samples s1 and s2 and features o1 and o2 to see if this can
# obtain the original table structure.
table = pd.DataFrame(
[[1, 0, 1, 0, 0],
[1, 1, 0, 0, 0],
[0, 0, 1, 1, 0],
[0, 0, 0, 1, 1]],
columns=['s2', 's1', 's3', 's4', 's5'],
index=['o2', 'o1', 'o3', 'o4']).T

gradient = pd.Series(
[2, 1, 3, 4, 5],
index=['s2', 's1', 's3', 's4', 's5'])

exp_table = pd.DataFrame(
[[1, 0, 1, 0, 0],
[1, 1, 0, 0, 0],
[0, 0, 1, 1, 0],
[0, 0, 0, 1, 1]],
columns=['s2', 's1', 's3', 's4', 's5'],
index=['o2', 'o1', 'o3', 'o4']).T

exp_gradient = pd.Series(
[2, 1, 3, 4, 5],
index=['s2', 's1', 's3', 's4', 's5'])

niche_sort(table, gradient)
pdt.assert_frame_equal(exp_table, table)
pdt.assert_series_equal(exp_gradient, gradient)


if __name__ == '__main__':
unittest.main()
Loading