Skip to content

BUG: _get_dtype accepts CategoricalDtype but throws TypeError with a Categorical #16817

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
topper-123 opened this issue Jul 3, 2017 · 7 comments · Fixed by #16887
Closed
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@topper-123
Copy link
Contributor

topper-123 commented Jul 3, 2017

Problem description

I find it inconistent that _get_dtype accept a CategoricalDtype, but will thow TypeError on Categoricals.

Code Sample, a copy-pastable example if possible

>>> cats = pd.Categorical([1,2,3])
>>> pd.core.dtypes.common._get_dtype(cats.dtype)
category
>>> pd.core.dtypes.common._get_dtype(cats)
TypeError: data type not understood

Expected Output

>>> cats = pd.Categorical([1,2,3])
>>> pd.core.dtypes.common._get_dtype(cats.dtype)
category
>>> pd.core.dtypes.common._get_dtype(cats)
category

Proposed solution

A solution could be to check for extension type (maybe just categorical dtype, not sure) + that arr_or_dtype has a dtype attribute:

def _get_dtypes(arr_or_dtype):
   ... 
    elif isinstance(arr_or_dtype, IntervalDtype):
        return arr_or_dtype   # old code
   elif is_extension_type(arr_or_dtype) and hasattr(arr_or_dtype, 'dtype'):   # new code
        return arr_or_dtype.dtype   # new code
    elif isinstance(arr_or_dtype, string_types):   # old code
      ...
@TomAugspurger
Copy link
Contributor

This is an internal method, do you actually need it for something?

@jreback
Copy link
Contributor

jreback commented Jul 4, 2017

  • We have no tests for these routines (either _get_dtype or _get_dtype_type). rather we have bunches of integration tests via all of the is_* routines. Would be nice to nail down what this actually takes / returns.
  • We should actually make _get_dtype_type just call _get_dtype (and have logic live there)
  • We should be much more consistent about calling _get_dtype_type rather than the more specific _get_dtype, IOW, all of the is_* routines should call _get_dtype_type (most of them do, but some just call _get_dtype).

So if we can nail down these issues, then can address what else / what should _get_dtype actually take. I think It should actually be quite a bit more strict. We are a bit loose in the calling code for this. Maybe just need to add some assertions. These are core routines which are called implicity a lot.

@jreback jreback added API Design Difficulty Intermediate Dtype Conversions Unexpected or buggy dtype conversions labels Jul 4, 2017
@jreback jreback added this to the Next Major Release milestone Jul 4, 2017
@jreback
Copy link
Contributor

jreback commented Jul 4, 2017

@topper-123 love to have some help here with this :>

@topper-123
Copy link
Contributor Author

Hi,

I think that could be relatively easy, and could give it a shot. Some issues, though:

  • I can't get pandas.test() to run without breaking. Likewise, I've tried py.test pandas/tests/dtypes/test_common.py after downloading the source code from github, but it fails. That makes me feel a bit apprehensive.
  • I don't know C or Cython and can't prioritoze learning ATM, are those required?
  • Some issue in _get_dtype are strange and I'm not sure the function always returns a correct result. E.g.
     >>> pd.core.dtypes.common._get_dtype(pd.Categorical([1,2,3]))
     TypeError: data type not understood
     >>> pd.core.dtypes.common._get_dtype(pd.Categorical([1,2,3]).dtype)
     category
    That is, the first one fail, while the second one succeeds, which IMO doesn't make sense (but maybe there is an aspect that I don't understand?). And because I can't get the tests to run, I can't poke around the issue so easily.

In short, if I can get some help on how to get the tests to run cleanly, and C/Cython knowledge isn't required, I'll give this a go. I'm on Windows 10 and use the Anaconda distribution.

@jreback
Copy link
Contributor

jreback commented Jul 5, 2017

have a look here: http://pandas.pydata.org/pandas-docs/stable/contributing.html

pretty complete instructions on how to run a dev environment on windows. its pretty straightforward, but if you need help pls ping.

@topper-123
Copy link
Contributor Author

And the fact that _get_dtype fails on categoricals, isn't that a bug (or a insufficient implementation)? It seems unfortunate that calling _get_dtype with a Pandas categorical should fail.

That is, should it be marked in the tests as ok or a failure that _get_dtype raises when given a categorical?

@jreback
Copy link
Contributor

jreback commented Jul 6, 2017

its reasonable to accept a Categorical instance I think.
We want the consitency for ndarrays, e.g. we can accept an ndarray, or its dtype. Scalars raise.

In [8]: from pandas.core.dtypes.common import _get_dtype

In [9]: _get_dtype(1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-d4be99dce80c> in <module>()
----> 1 _get_dtype(1)

/Users/jreback/pandas/pandas/core/dtypes/common.py in _get_dtype(arr_or_dtype)
   1725     if hasattr(arr_or_dtype, 'dtype'):
   1726         arr_or_dtype = arr_or_dtype.dtype
-> 1727     return np.dtype(arr_or_dtype)
   1728 
   1729 

TypeError: data type not understood

In [10]: _get_dtype(np.array([1]))
Out[10]: dtype('int64')

In [11]: _get_dtype(np.array([1]).dtype)
Out[11]: dtype('int64')

In [12]: _get_dtype(Series(np.array([1])))
Out[12]: dtype('int64')

@jreback jreback modified the milestones: 0.21.0, Next Major Release Jul 13, 2017
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants