Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

no skip_rows option for Excel files #19161

Open
TiburonEl opened this issue Oct 9, 2024 · 5 comments
Open

no skip_rows option for Excel files #19161

TiburonEl opened this issue Oct 9, 2024 · 5 comments
Labels
A-io-spreadsheet Area: reading/writing Excel/ODS files enhancement New feature or an improvement of an existing feature

Comments

@TiburonEl
Copy link

Description

I'm having trouble importing an Excel file where the first few rows contain merged cells. I considered skipping these rows during import, but unfortunately, there is no skip_rows option available for Excel files in Polars (only for CSV files).

Is there a way to skip these rows? If not, could this functionality be added in a future release?

@TiburonEl TiburonEl added the enhancement New feature or an improvement of an existing feature label Oct 9, 2024
@cmdlineluser
Copy link
Contributor

read_options={"skip_rows": ...} will get passed to the underlying engine1.

Footnotes

  1. http://fastexcel.toucantoco.dev/fastexcel.html#ExcelReader.load_sheet_by_name

@TiburonEl
Copy link
Author

TiburonEl commented Oct 9, 2024

From what I understand, this parameter can only be an integer, meaning I can skip just one row. What if I want to skip multiple rows? Imagine I was to skip row 1,2 and 5th

@avimallu
Copy link
Contributor

skip_rows is typically intended to ignore a certain number of rows before reading a full table in, because of the Excel file author's formatting preferences.

The operation you're describing is more of a filter, which is what I think you should be using with something like .filter(~pl.int_range(0, pl.len()).is_in(rows_to_skip).

@anapaulagomes
Copy link

I'd love to work on this issue.

@anapaulagomes
Copy link

This is what worked for me by testing the proposed solutions:

# with read_options
land = pl.read_excel("data/DTB_2022/RELATORIO_DTB_BRASIL_MUNICIPIO.xls", read_options={"header_row": 6})

# without read_options
land = pl.read_excel("data/DTB_2022/RELATORIO_DTB_BRASIL_MUNICIPIO.xls")
land.filter(~pl.int_range(0, pl.len()).lt(5))

I used header_row because it skips the rows and uses the row passed as header.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A-io-spreadsheet Area: reading/writing Excel/ODS files enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

5 participants