-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Implement Series.cov #1620
base: master
Are you sure you want to change the base?
Implement Series.cov #1620
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4858,6 +4858,54 @@ def mad(self): | |
|
||
return mad | ||
|
||
def cov(self, other: "Series", min_periods: Optional[int] = None) -> float: | ||
""" | ||
Return the covariance between two series. | ||
|
||
Parameters | ||
---------- | ||
other : Series | ||
min_periods : int | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe
|
||
|
||
Examples | ||
-------- | ||
itholic marked this conversation as resolved.
Show resolved
Hide resolved
|
||
>>> s1 = ks.Series([1, 2, 3, 4]) | ||
>>> s2 = ks.Series([5, 6, 7, 8]) | ||
>>> s1 | ||
0 1 | ||
1 2 | ||
2 3 | ||
3 4 | ||
Name: 0, dtype: int64 | ||
|
||
>>> s2 | ||
0 5 | ||
1 6 | ||
2 7 | ||
3 8 | ||
Name: 0, dtype: int64 | ||
|
||
>>> s1.cov(s2) | ||
1.666666... | ||
""" | ||
|
||
if not isinstance(other, Series): | ||
raise ValueError("'other' must be a Series") | ||
|
||
if len(self.index) != len(other.index): | ||
raise ValueError("series are not aligned") | ||
Comment on lines
+4892
to
+4893
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where is this from? Seems like pandas works even with a different length of Series. >>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6]))
0.5 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oops, I missed it. Thanks, @ueshin . There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mmm this is interesting. Seems like pandas performs an alignment between the series before computing the covariance. So, this: >>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6]))
0.5 And this: >>> pd.Series([1, 2]).cov(pd.Series([5, 6]))
0.5 are equivalent... I believe this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lopez- Could you file the issue for |
||
|
||
min_periods = 0 if min_periods is None else min_periods | ||
if len(self.index) < min_periods or len(self.index) <= 1: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should also compare >>> pd.Series([1, 2]).cov(pd.Series([5, 6, 7, 8]), min_periods=3)
nan |
||
return np.nan | ||
|
||
if same_anchor(self, other): | ||
# if the have the same anchor use the more performant Spark native `cov` | ||
return self._internal.spark_frame.cov(self.name, other.name) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. self._kdf._internal.resolved_copy.spark_frame.cov(
self._internal.data_spark_column_names[0],
other._internal.data_spark_column_names[0]) ? FYI: |
||
else: | ||
# if not on the same anchor calculate covariance manually | ||
return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What do you think about we assign a proper variable and reuse it?
Comment on lines
+4903
to
+4904
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we should create a new DataFrame and use it, something like: kdf = self._kdf.copy()
tmp_column = verify_temp_column_name(kdf, '__tmp_column__')
kdf[tmp_column] = other
return kdf._kser_for(self._column_label).cov(kdf._kser_for(tmp_column), min_period=min_period) I haven't checked the code, so please modify as it works. Btw, we should do this at the beginning of this method to avoid extra checks for length or something. |
||
|
||
def unstack(self, level=-1): | ||
""" | ||
Unstack, a.k.a. pivot, Series with MultiIndex to produce DataFrame. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -948,6 +948,32 @@ def test_series_repeat(self): | |
else: | ||
self.assert_eq(kser1.repeat(kser2).sort_index(), pser1.repeat(pser2).sort_index()) | ||
|
||
def test_cov(self): | ||
kser = ks.Series([90, 91, 85]) | ||
pser = kser.to_pandas() | ||
kser_other = ks.Series([90, 91, 85]) | ||
pser_other = kser_other.to_pandas() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please define pandas object first. pser = pd.Series([90, 91, 85])
kser = ks.from_pandas(pser) |
||
|
||
self.assert_eq(kser.cov(kser_other), pser.cov(pser_other), almost=True) | ||
|
||
kser = ks.Series([90]) | ||
pser = kser.to_pandas() | ||
kser_other = ks.Series([85]) | ||
pser_other = kser_other.to_pandas() | ||
|
||
k_isnan = np.isnan(kser.cov(kser_other)) | ||
p_isnan = np.isnan(pser.cov(pser_other)) | ||
self.assert_eq(k_isnan, p_isnan) | ||
|
||
kser = ks.Series([90, 91, 85]) | ||
pser = kser.to_pandas() | ||
kser_other = ks.Series([90, 91, 85]) | ||
pser_other = kser_other.to_pandas() | ||
|
||
k_isnan = np.isnan(kser.cov(kser_other, 4)) | ||
p_isnan = np.isnan(pser.cov(pser_other, 4)) | ||
self.assert_eq(k_isnan, p_isnan) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we have a test when each Series has a different index and an Exception case? For example, kser = ks.Series([90, 91, 85], index=[1, 2, 3])
pser = kser.to_pandas()
kser_other = ks.Series([90, 91, 85], index=[-1, -2, -3])
pser_other = kser_other.to_pandas()
self.assert_eq(kser.cov(kser_other), pser.cov(pser_other), almost=True) and self.assertRaises(ValueError, lambda: kser.cov([90, 91, 85])) # 'other' must be a Series
self.assertRaises(ValueError, lambda: kser.cov(ks.Series([90]))) # series are not aligned |
||
|
||
class OpsOnDiffFramesDisabledTest(ReusedSQLTestCase, SQLTestUtils): | ||
@classmethod | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we just copy the docstring from pandas' with a few modification of examples?