-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Jupysql with autopolars crashes when schema cannot be inferred from the first 100 rows #312
Comments
@jorisroovers thanks for reporting! Feel free to submit a PR and happy to guide you through! |
Maybe a more straightforward solution is not to use a generator? If we really need to add another config for polars, I think wrapping them in a single config as you mentioned: %config SqlMagic.polars_dataframe_kwargs = {"infer_schema_length": None} |
0.6.6
You're right - just wrapping in a list works too: frame = pl.DataFrame(list(tuple(row) for row in self), schema=self.keys) That should be a tiny PR then (will include a test too) 👍 |
Allows for passing of custom keyword arguments to the Polars DataFrame constructor. Fixes ploomber#312
I take this back, this doesn't actually work:
Long story short, I went with the |
thanks @jorisroovers for taking a look at this! we'll review your PR! |
Do we still have the problem if we do it like this? frame = pl.DataFrame(list(list(row) for row in self), schema=self.keys) I'm starting to think that adding the kwargs is an overkill since most of the options are data-specific (e.g., the schema). so it's not very useful to put a global option at the top. If anything, maybe we can automatically pass a larger threshold to define a schema (say, 1k observations) What is definitely useful is allowing to pass options to the constructor (as you already did in your PR): results.PolarsDataFrame(a=1, b=2) |
Yes, same error :/
Hmm, that I can see yes. Although I can see that a lot of folks would also set
FWIW, I discovered this issue when reading in a csv that had an unexpected string in an numeric column on row 3000-something. It would be really nice if we didn't have to do a cleaning step up front to deal with this issue by setting
I'm not entirely following this suggestion and how it's different from what I implemented in #325. Can you please elaborate? Thanks! |
Ok, so it sounds like |
By default, Polars infers the schema for a Dataframe column from the first 100 rows (see infer_schema_length) in case a generator is passed. This leads to a problem with jupysql when
SqlMagic.autopolars = True
and the datatype for a column in theResultSet
cannot be correctly inferred from the first 100 rows.Consider the notebook below that shows the issue.
As noted, the reason this fails is because
ResultSet
is a generator, in which case Polars will only look at the first 100 rows to infer the column type.jupysql/src/sql/run.py
Line 190 in 65c99f4
In the example above, the first 100 rows are
NULL
in which case Polars infers its default typei64
. When it then encounters"foo"
, it errors because"foo"
is clearly not ani64
.As show in the example as well, the fix is to set
infer_schema_length
in the Dataframe constructor. Since this has a performance implication though, I believe this should ideally be exposed asSqlMagic
config option.I'm happy to implement this (next week probably) if you'd accept the change - let me know!
The text was updated successfully, but these errors were encountered: