-
Notifications
You must be signed in to change notification settings - Fork 43
Fix stata
saving
#624
Fix stata
saving
#624
Conversation
mlem/contrib/pandas.py
Outdated
if has_index(df): | ||
if ( | ||
has_index(df) | ||
and PANDAS_FORMATS["stata"].write_func != self.write_func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This problem occurs only with stata format, and only in the case described in original issue (i.e. when you import stata format file and want to re-write initial content - cause otherwise the dataframe will be saved in csv
and no problem occurs at write). In this case there could be no real index (since stata format doesn't support saving index in some special way - it's going to be saved as a column). Then at import MLEM cannot have an index in this dataframe, which allows us to skip resetting it.
I can't see any solution better than creating a workaround like: if you have empty index, rename column to something like "__index__"
, and save the instruction to rename it on load back to ""
. Which is something I'd like to avoid now since we don't write to stata
format anyway for now - only to csv (which means this problem just won't occur except for special case I mentioned in the first paragraph).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the original reporter in the issue, I don't completely know why we're resetting index here. Any idea @aguschin ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may use index in the model itself. E.g. the index may have some information used by the model directly, like customer id, timestamp, etc. So, to make things more reproducible and precise, we decided to keep the index. We argued about this with @mike0sv a while ago, so that's a decision made early on. We can debate again whether this is a bad practice and/or maybe change the default behavior: not store index by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. This is the sort of things usually code comments are used for - explain why we're doing non-obvious operations (can link to gh isues/discussions, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to add comments - why reset index, and now adding a comment for the condition linking to the issue
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #624 +/- ##
==========================================
- Coverage 86.17% 86.16% -0.01%
==========================================
Files 107 107
Lines 9705 9710 +5
==========================================
+ Hits 8363 8367 +4
- Misses 1342 1343 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
close #618