Skip to content

fix: read_csv with both index_col and use_cols inconsistent with pandas #1785

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 4 commits into from
Jun 6, 2025

Conversation

chelsea-lin
Copy link
Contributor

Fixes internal issue 408499371 🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels May 30, 2025
@chelsea-lin chelsea-lin force-pushed the main_chelsealin_readcsv branch from 032f193 to 8d6d9ee Compare May 30, 2025 23:46
@chelsea-lin chelsea-lin marked this pull request as ready for review May 30, 2025 23:46
@chelsea-lin chelsea-lin requested review from a team as code owners May 30, 2025 23:46
@chelsea-lin chelsea-lin requested a review from tswast May 30, 2025 23:46
@chelsea-lin chelsea-lin force-pushed the main_chelsealin_readcsv branch 2 times, most recently from 0785cf8 to 344d6c9 Compare June 3, 2025 22:52
index_col=index_col,
columns=columns,
names=names,
is_index_in_columns=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this parameter name. Wouldn't the read_gbq_table function be able to figure out that the index columns are present already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to index_col_in_columns and added docstring.

@@ -96,7 +96,9 @@ def _to_index_cols(
return index_cols


def _check_column_duplicates(index_cols: Iterable[str], columns: Iterable[str]):
def _check_column_duplicates(
index_cols: Iterable[str], columns: Iterable[str], is_index_in_columns: bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the logic, I still don't understand the is_index_in_columns name. If there isn't a better name, could we at least add some docstrings with more information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to index_col_in_columns and added docstring.


# BigFrames requires `sort_index()` because BigQuery doesn't preserve row IDs
# (b/280889935) or guarantee row ordering.
bf_df = bf_df.sort_index()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we sort by the index already if we determine it's unique?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catches! Removed it from all similar tests.

@chelsea-lin chelsea-lin force-pushed the main_chelsealin_readcsv branch from 344d6c9 to 7e59b20 Compare June 5, 2025 17:46
@chelsea-lin chelsea-lin requested a review from tswast June 5, 2025 17:47
@chelsea-lin chelsea-lin added the kokoro:run Add this label to force Kokoro to re-run the tests. label Jun 5, 2025
@bigframes-bot bigframes-bot removed the kokoro:run Add this label to force Kokoro to re-run the tests. label Jun 6, 2025
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@tswast tswast enabled auto-merge (squash) June 6, 2025 20:19
@tswast tswast merged commit ba7c313 into main Jun 6, 2025
17 of 24 checks passed
@tswast tswast deleted the main_chelsealin_readcsv branch June 6, 2025 20:24
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants