Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Get locations of cells with invalid types #308

Open
noctuid opened this issue Oct 21, 2024 · 3 comments
Open

Get locations of cells with invalid types #308

noctuid opened this issue Oct 21, 2024 · 3 comments
Labels
feature request question Further information is requested 🗒️ calamine 🗒️ Calamine-related issue
Milestone

Comments

@noctuid
Copy link

noctuid commented Oct 21, 2024

On 0.12.0, I'm not seeing an error if there is a row with e.g."foo" for a column that is set to "date" in "dtypes". That cell just is read as null. A string in a float column is also just converted to null. Is this expected? It would be great there was some way to get the locations of all failed cells, so that we could give the user back a file with bad cells highlighted.

@lukapeschke lukapeschke added the question Further information is requested label Oct 22, 2024
@lukapeschke
Copy link
Collaborator

It is the expected behavior, yes. The rationale is that if the dtype is specified, the user knows what kind of data is contained in the column.

I reckon it'd be nice to at least have a log in case the cell is of an unexpected type, but that would require work on calamine's side. We could probably add some methods to the DataType trait which would return Result<Option<_>>, with an error in case the cell is not empty, but isn't of the expected dtype either.

However we'd need to think a bit about this, since we don't want to affect performance (maybe through a new warn_on_unexpected_dtype parameter ?).

@PrettyWood what are your thoughts on this ?

@noctuid
Copy link
Author

noctuid commented Oct 22, 2024

So does calamine just currently silently convert invalid data to nulls? Ideally, if bad data is converted to null, we'd need some way where we could get/add an extra bool column for each column to the dataframe to signify whether the data was a valid type or had to be converted to null.

@lukapeschke lukapeschke added this to the v1.0.0 milestone Oct 25, 2024
@lukapeschke
Copy link
Collaborator

Yes, calamine kinda silently converts unexpected data to nulls: it provides is_T() -> bool methods for every excel datatype, along with get_T() -> Option<T> . The get_T methods return None if they are called on an unexpected variant (get_datetime on an int for example). You're supposed to call is_T before using get_T.

But for columns where the dtype is provided, fastexcel skips the inference step, since it requires iterating over a lot of values. This means, we directly use get_T. A solution to this would be to add try_get_T methods to calamine, which would allow to get an error in case the data is of an unexpected type.

But we'd also need to design an API around this in fastexcel: do we want to load the data with an attribute indicating that the cell contains invalid data ? Do we want to provide a parameter indicating what to do with errors ? This could be 'ignore' | 'store' | 'raise' .

I'm afraid there's no quick and easy solution for this one

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
feature request question Further information is requested 🗒️ calamine 🗒️ Calamine-related issue
Projects
None yet
Development

No branches or pull requests

2 participants