-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
When executed on a DataFrame, rolling seems to select only certain columns for processing. For demonstration, I created a DataFrame that has three columns (A, B, and C), of which the first contains TimeDeltas and the other contain Floats. When using rolling, e.g. with sum, only the Floats are passed on.
Even stranger, when used in combination with apply, only the first column containing Floats is passed to the function, whereas I would have expected the corresponding part of the DataFrame.
Code Sample, a copy-pastable example
import pandas as pd
columns = ["A", "B", "C"]
index = list(range(10))
data = [[10**10,2,3]]*len(index)
df = pd.DataFrame(columns = columns, index = index, data=data)
df["A"] = df["A"].apply(pd.to_timedelta)
The resulting df will look like this:
A B C
0 00:00:10 2 3
1 00:00:10 2 3
2 00:00:10 2 3
3 00:00:10 2 3
4 00:00:10 2 3
5 00:00:10 2 3
6 00:00:10 2 3
7 00:00:10 2 3
8 00:00:10 2 3
9 00:00:10 2 3
Applying rolling with sum like this:
df.rolling(window=2).sum()
will result in the following output, in which the first column is missing:
B C
0 NaN NaN
1 4.0 6.0
2 4.0 6.0
3 4.0 6.0
4 4.0 6.0
5 4.0 6.0
6 4.0 6.0
7 4.0 6.0
8 4.0 6.0
9 4.0 6.0
To demonstrate the problem with apply, I created a custom function that simply outputs the number of columns (since I expected a DataFrame to be passed to the function:
def get_num_columns(sub_df):
print(sub_df)
return len(sub_df.columns)
df.rolling(window=2).apply(get_num_columns, raw=False)
This produces the exception "AttributeError: 'Series' object has no attribute 'columns'" and the following printout:
0 2.0
1 2.0
dtype: float64
Problem description
I would expect in both cases that the windowed DataFrame with all columns is used within the function (either sum or get_num_columns).
Expected Output
In the case of sum, I would either expect an Exception that tells the user that only Floats are acceptable or - preferably - the following output:
A B C
0 NaT NaN NaN
1 00:00:20 4.0 6.0
2 00:00:20 4.0 6.0
3 00:00:20 4.0 6.0
4 00:00:20 4.0 6.0
5 00:00:20 4.0 6.0
6 00:00:20 4.0 6.0
7 00:00:20 4.0 6.0
8 00:00:20 4.0 6.0
9 00:00:20 4.0 6.0
In the case of apply, I would have expected a DataFrame as input to the function. Therefore, the output of the function (without the prints) should be:
0 3
1 3
2 3
3 3
4 3
5 3
6 3
7 3
8 3
9 3
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.76-linuxkit
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.5
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.3.1.post20200616
Cython : 0.29.20
pytest : 5.4.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : None
pymysql : 0.9.3
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.1
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : 5.4.3
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.6.1
tabulate : None
xarray : 0.15.1
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : 0.48.0