Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Coercing dates to strings adds time of the form '00:00:00' #252

Open
severinh opened this issue Jul 17, 2024 · 2 comments
Open

Coercing dates to strings adds time of the form '00:00:00' #252

severinh opened this issue Jul 17, 2024 · 2 comments
Labels
🗒️ calamine 🗒️ Calamine-related issue

Comments

@severinh
Copy link

severinh commented Jul 17, 2024

How to reproduce

Suppose you have an Excel file with the following mixed-type data:

Header
=DATE(2024, 7, 1)
=DATE(2024, 7, 2)
"some string"

Now lets read this Excel file into a Polars dataframe, which coerces the column to strings:
excel_reader.load_sheet(0).to_polars()

This produces the following data frame:

Header
"2024-07-01 00:00:00"
"2024-07-02 00:00:00"
"some string"

Expected behavior

What I would have expected fastexcel to not include the time 00:00:00 for these dates. That is,

Header
"2024-07-01"
"2024-07-02"
"some string"

Test case

Excel sheet: sheet-date.xlsx

def test_date_casting_to_string() -> None:
    excel_reader = fastexcel.read_excel(path_for_fixture("sheet-date.xlsx"))
    sheet = excel_reader.load_sheet(0, column_names=["col1"])
    expected = {
        "col1": ["2024-07-01", "2024-07-02", "some string"],
    }
    pl_assert_frame_equal(sheet.to_polars(), pl.DataFrame(expected))
@PrettyWood
Copy link
Member

Yup the issue is that it's detected as datetime by calamine. We can try to do a as_date() afterwards but it would be better if calamine could handle directly dates.
I'll dig into into it this weekend but it won't be a quick fix like the bool to string. I'll probably need to open a PR on calamine side

@severinh
Copy link
Author

severinh commented Jul 17, 2024

Thanks @PrettyWood! I suspected that this one would not be as straight-forward.

In any case, no pressure on this. In contrast to #250, this issue here is not blocking our adoption of fastexcel. We can work around this for now by trimming 00:00:00 as a post-processing step. Something like:

sheet_df.with_columns(pl.all().str.replace_all(r"^([0-9]{4}-[0-9]{2}-[0-9]{2}) 00:00:00$", "$1"))

@lukapeschke lukapeschke added the 🗒️ calamine 🗒️ Calamine-related issue label Jul 18, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
🗒️ calamine 🗒️ Calamine-related issue
Projects
None yet
Development

No branches or pull requests

3 participants