Skip to content
This repository was archived by the owner on Sep 13, 2023. It is now read-only.

Fix stata saving #624

Merged
merged 3 commits into from
Mar 6, 2023
Merged

Fix stata saving #624

merged 3 commits into from
Mar 6, 2023

Conversation

aguschin
Copy link
Contributor

close #618

@aguschin aguschin self-assigned this Feb 28, 2023
@aguschin aguschin requested a review from a team February 28, 2023 13:55
@aguschin aguschin temporarily deployed to internal February 28, 2023 13:55 — with GitHub Actions Inactive
@aguschin aguschin requested a review from omesser February 28, 2023 13:55
@aguschin aguschin temporarily deployed to internal February 28, 2023 14:09 — with GitHub Actions Inactive
if has_index(df):
if (
has_index(df)
and PANDAS_FORMATS["stata"].write_func != self.write_func
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This problem occurs only with stata format, and only in the case described in original issue (i.e. when you import stata format file and want to re-write initial content - cause otherwise the dataframe will be saved in csv and no problem occurs at write). In this case there could be no real index (since stata format doesn't support saving index in some special way - it's going to be saved as a column). Then at import MLEM cannot have an index in this dataframe, which allows us to skip resetting it.

I can't see any solution better than creating a workaround like: if you have empty index, rename column to something like "__index__", and save the instruction to rename it on load back to "". Which is something I'd like to avoid now since we don't write to stata format anyway for now - only to csv (which means this problem just won't occur except for special case I mentioned in the first paragraph).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the original reporter in the issue, I don't completely know why we're resetting index here. Any idea @aguschin ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may use index in the model itself. E.g. the index may have some information used by the model directly, like customer id, timestamp, etc. So, to make things more reproducible and precise, we decided to keep the index. We argued about this with @mike0sv a while ago, so that's a decision made early on. We can debate again whether this is a bad practice and/or maybe change the default behavior: not store index by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. This is the sort of things usually code comments are used for - explain why we're doing non-obvious operations (can link to gh isues/discussions, etc)

Copy link
Contributor

@omesser omesser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to add comments - why reset index, and now adding a comment for the condition linking to the issue

@aguschin aguschin temporarily deployed to internal March 6, 2023 06:03 — with GitHub Actions Inactive
@aguschin aguschin enabled auto-merge (squash) March 6, 2023 06:04
@codecov
Copy link

codecov bot commented Mar 6, 2023

Codecov Report

Patch coverage: 87.50% and project coverage change: -0.01 ⚠️

Comparison is base (1a1d203) 86.17% compared to head (5e799d2) 86.16%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #624      +/-   ##
==========================================
- Coverage   86.17%   86.16%   -0.01%     
==========================================
  Files         107      107              
  Lines        9705     9710       +5     
==========================================
+ Hits         8363     8367       +4     
- Misses       1342     1343       +1     
Impacted Files Coverage Δ
mlem/contrib/streamlit/_template.py 0.00% <0.00%> (ø)
mlem/contrib/pandas.py 93.78% <100.00%> (+0.08%) ⬆️
mlem/contrib/streamlit/server.py 66.98% <100.00%> (ø)
mlem/core/requirements.py 95.43% <0.00%> (-0.33%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@aguschin aguschin disabled auto-merge March 6, 2023 08:36
@aguschin aguschin merged commit 92782f5 into main Mar 6, 2023
@aguschin aguschin deleted the fix/stata branch March 6, 2023 08:36
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reset_index inside write method of PandasFormat class crashes the reader for stata format
2 participants