Skip to content

BUG: Joining a DataFrame with a PeriodIndex fails #16541

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
max-sixty opened this issue May 30, 2017 · 8 comments · Fixed by #16586
Closed

BUG: Joining a DataFrame with a PeriodIndex fails #16541

max-sixty opened this issue May 30, 2017 · 8 comments · Fixed by #16586
Labels
Bug Period Period data type Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@max-sixty
Copy link
Contributor

Code Sample

In [19]: dates = pd.period_range('20100101','20100105', freq='D')

In [20]: weights = pd.DataFrame(np.random.randn(5, 5), index=dates, columns = ['g1_%d' % x for x in range(5)])

In [21]: weights.join(pd.DataFrame(np.random.randn(5,5), index=dates, columns = ['g2_%d' % x for x in range(5)]))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-2fdb8b02f5a4> in <module>()
      1 weights.join(
----> 2             pd.DataFrame(np.random.randn(5,5), index=dates, columns = ['g2_%d' % x for x in range(5)]))

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in join(self, other, on, how, lsuffix, rsuffix, sort)
   4765         # For SparseDataFrame's benefit
   4766         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 4767                                  rsuffix=rsuffix, sort=sort)
   4768
   4769     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   4780             return merge(self, other, left_on=on, how=how,
   4781                          left_index=on is None, right_index=True,
-> 4782                          suffixes=(lsuffix, rsuffix), sort=sort)
   4783         else:
   4784             if on is not None:

/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)
     52                          right_index=right_index, sort=sort, suffixes=suffixes,
     53                          copy=copy, indicator=indicator)
---> 54     return op.get_result()
     55
     56

/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.pyc in get_result(self)
    567                 self.left, self.right)
    568
--> 569         join_index, left_indexer, right_indexer = self._get_join_info()
    570
    571         ldata, rdata = self.left._data, self.right._data

/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.pyc in _get_join_info(self)
    720             join_index, left_indexer, right_indexer = \
    721                 left_ax.join(right_ax, how=self.how, return_indexers=True,
--> 722                              sort=self.sort)
    723         elif self.right_index and self.how == 'left':
    724             join_index, left_indexer, right_indexer = \

TypeError: join() got an unexpected keyword argument 'sort'

It seems the sort kwarg is invalid, but the internals are passing it in regardless

Output of pd.show_versions()

In [22]: pd.show_versions() /usr/local/lib/python2.7/dist-packages/xarray/core/formatting.py:16: FutureWarning: The pandas.tslib module is deprecated and will be removed in a future version. from pandas.tslib import OutOfBoundsDatetime

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.13-moby
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: 0.9.2
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: None
numexpr: 2.6.2
feather: None
matplotlib: 2.0.1
openpyxl: None
xlrd: None
xlwt: 1.2.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: 0.1.6
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented May 31, 2017

this fixes if you can do a PR

diff --git a/pandas/core/indexes/period.py b/pandas/core/indexes/period.py
index 15fd9b7..50d5958 100644
--- a/pandas/core/indexes/period.py
+++ b/pandas/core/indexes/period.py
@@ -919,14 +919,16 @@ class PeriodIndex(DatelikeOps, DatetimeIndexOpsMixin, Int64Index):
                               self[loc:].asi8))
         return self._shallow_copy(idx)
 
-    def join(self, other, how='left', level=None, return_indexers=False):
+    def join(self, other, how='left', level=None, return_indexers=False,
+             sort=False):
         """
         See Index.join
         """
         self._assert_can_do_setop(other)
 
         result = Int64Index.join(self, other, how=how, level=level,
-                                 return_indexers=return_indexers)
+                                 return_indexers=return_indexers,
+                                 sort=sort)
 
         if return_indexers:
             result, lidx, ridx = result

obviously need some more tests on the index join methods as well :>

Here is the tests for datetimes in pandas/tests/indexes/datetimes/test_datetimes.py
need to do something like this in periods/test_period.py

    def test_join_self(self):
        index = date_range('1/1/2000', periods=10)
        kinds = 'outer', 'inner', 'left', 'right'
        for kind in kinds:
            joined = index.join(index, how=kind)
            assert index is joined

@jreback jreback added Bug Difficulty Novice Period Period data type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 31, 2017
@jreback jreback added this to the 0.20.2 milestone May 31, 2017
@jreback
Copy link
Contributor

jreback commented May 31, 2017

if you can do this in next day or 2 can get into 0.20.2 (end of week)

@rosygupta
Copy link

@jreback Is this issue still open?

@max-sixty
Copy link
Contributor Author

PR waiting here: #16586

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 8, 2017

So I used my version 0.20.1 to add the fix suggested above by @jreback , and that fixed the problem for me, but then a different one cropped up. Not sure if I should just put this in a different issue. In my use case, I took dates and made them a monthly period, and there are duplicates. Here is a way to make it happen:

perindex = pd.period_range('2016-01-01', periods=16, freq='M')
perdf = pd.DataFrame([i for i in range(len(perindex))],
                     index=perindex, columns=['pnum'])
df2 = pd.concat([perdf, perdf])
perdf.merge(df2, left_index=True, right_index=True, how='outer')

This gives this sequence of errors:

TypeError                                 Traceback (most recent call last)
<ipython-input-45-a9a1ea5d6a78> in <module>()
      1 df2 = pd.concat([perdf, perdf])
----> 2 perdf.merge(df2, left_index=True, right_index=True, how='outer')

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\frame.py in merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)
   4818                      right_on=right_on, left_index=left_index,
   4819                      right_index=right_index, sort=sort, suffixes=suffixes,
-> 4820                      copy=copy, indicator=indicator)
   4821 
   4822     def round(self, decimals=0, *args, **kwargs):

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)
     52                          right_index=right_index, sort=sort, suffixes=suffixes,
     53                          copy=copy, indicator=indicator)
---> 54     return op.get_result()
     55 
     56 

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in get_result(self)
    567                 self.left, self.right)
    568 
--> 569         join_index, left_indexer, right_indexer = self._get_join_info()
    570 
    571         ldata, rdata = self.left._data, self.right._data

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in _get_join_info(self)
    720             join_index, left_indexer, right_indexer = \
    721                 left_ax.join(right_ax, how=self.how, return_indexers=True,
--> 722                              sort=self.sort)
    723         elif self.right_index and self.how == 'left':
    724             join_index, left_indexer, right_indexer = \

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\indexes\period.py in join(self, other, how, level, return_indexers, sort)
    927 
    928         result = Int64Index.join(self, other, how=how, level=level,
--> 929                                  return_indexers=return_indexers, sort=sort)
    930 
    931         if return_indexers:

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\indexes\base.py in join(self, other, how, level, return_indexers, sort)
   2995             else:
   2996                 return self._join_non_unique(other, how=how,
-> 2997                                              return_indexers=return_indexers)
   2998         elif self.is_monotonic and other.is_monotonic:
   2999             try:

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\indexes\base.py in _join_non_unique(self, other, how, return_indexers)
   3076         left_idx, right_idx = _get_join_indexers([self.values],
   3077                                                  [other._values], how=how,
-> 3078                                                  sort=True)
   3079 
   3080         left_idx = _ensure_platform_int(left_idx)

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in _get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
    980 
    981     # get left & right join labels and num. of levels at each location
--> 982     llab, rlab, shape = map(list, zip(* map(fkeys, left_keys, right_keys)))
    983 
    984     # get flat i8 keys from label lists

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in _factorize_keys(lk, rk, sort)
   1409     if sort:
   1410         uniques = rizer.uniques.to_array()
-> 1411         llab, rlab = _sort_labels(uniques, llab, rlab)
   1412 
   1413     # NA group

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in _sort_labels(uniques, left, right)
   1435     labels = np.concatenate([left, right])
   1436 
-> 1437     _, new_labels = algos.safe_sort(uniques, labels, na_sentinel=-1)
   1438     new_labels = _ensure_int64(new_labels)
   1439     new_left, new_right = new_labels[:l], new_labels[l:]

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\algorithms.py in safe_sort(values, labels, na_sentinel, assume_unique)
    476     if compat.PY3 and lib.infer_dtype(values) == 'mixed-integer':
    477         # unorderable in py3 if mixed str/int
--> 478         ordered = sort_mixed(values)
    479     else:
    480         try:

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\algorithms.py in sort_mixed(values)
    469         str_pos = np.array([isinstance(x, string_types) for x in values],
    470                            dtype=bool)
--> 471         nums = np.sort(values[~str_pos])
    472         strs = np.sort(values[str_pos])
    473         return _ensure_object(np.concatenate([nums, strs]))

C:\Anaconda3\envs\py36\lib\site-packages\numpy\core\fromnumeric.py in sort(a, axis, kind, order)
    820     else:
    821         a = asanyarray(a).copy(order="K")
--> 822     a.sort(axis=axis, kind=kind, order=order)
    823     return a
    824 

pandas\_libs\period.pyx in pandas._libs.period._Period.__richcmp__ (pandas\_libs\period.c:12067)()

TypeError: Cannot compare type 'Period' with type 'int'

Let me know if I should open up a new issue, given that this bug happens when applying the above fix.

@max-sixty
Copy link
Contributor Author

Do you get the error run on that PR?

If so, I would open a new issue?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 8, 2017

@MaximilianR I did a hand edit of pandas 0.20.1 to implement what is in the PR, and got the error. To test it against all PR's, I think I'd need that PR to be merged into master and then I can pull master and test.

@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Jun 8, 2017
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.3, 0.21.0 Jun 8, 2017
@max-sixty
Copy link
Contributor Author

Great!

FYI you can pull someone's PR for convenience, rather than hand-editing

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Bug Period Period data type Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants