Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Document/test unstable decimal dtype #1033

Merged
merged 3 commits into from
Feb 3, 2025
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions lib/explorer/series.ex
Original file line number Diff line number Diff line change
@@ -186,6 +186,14 @@ defmodule Explorer.Series do
@type dtype_alias :: integer_dtype_alias | float_dtype_alias | decimal_dtype_alias
@type float_dtype_alias :: :float | :f32 | :f64
@type integer_dtype_alias :: :integer | :u8 | :u16 | :u32 | :u64 | :s8 | :s16 | :s32 | :s64
@typedoc """
Dtype for a fixed-point decimal represented internally as an integer.

> #### Warning: Unstable {: .warning}
>
> This functionality is considered unstable. It is a work-in-progress feature
> and may not always work as expected. It may be changed at any point.
"""
@type decimal_dtype_alias :: :decimal

@type t :: %Series{data: Explorer.Backend.Series.t(), dtype: dtype()}
65 changes: 65 additions & 0 deletions test/explorer/polars_backend/decimal_unstable_test.exs
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
defmodule Explorer.PolarsBackend.DecimalUnstableTest do
# This module tracks some oddities we've found with decimal dtypes. The Polars
# docs warn that decimals are unstable:
#
# > This functionality is considered unstable. It is a work-in-progress
# > feature and may not always work as expected. It may be changed at any
# > point without it being considered a breaking change.
#
# https://docs.pola.rs/api/python/stable/reference/api/polars.datatypes.Decimal.html
#
# If the tests in the module start breaking, it probably means Polars has
# changed its decimal implementation. Be prepared to change the tests: they
# function as canaries rather than imposing expected behavior.

use ExUnit.Case, async: true

alias Explorer.PolarsBackend

require Explorer.DataFrame, as: DF

setup do
%{df: DF.new(a: [Decimal.new("1.0"), Decimal.new("2.0")])}
end

test "mean returns decimal instead of float", %{df: df} do
assert_raise(
RuntimeError,
"""
DataFrame mismatch.

expected:

names: ["a"]
dtypes: %{"a" => {:f, 64}}

got:

names: ["a"]
dtypes: %{"a" => {:decimal, 38, 1}}
""",
fn -> DF.summarise(df, a: mean(a)) end
)
end
Comment on lines +25 to +43
Copy link
Member Author

@billylanchantin billylanchantin Nov 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a result of our expectations:

defp dtype_for_agg_operation(op, _)
when op in [:count, :nil_count, :size, :n_distinct, :argmin, :argmax],
do: {:u, 32}
defp dtype_for_agg_operation(op, series)
when op in [:first, :last, :sum, :min, :max, :product],
do: series.dtype
defp dtype_for_agg_operation(op, _) when op in [:all?, :any?], do: :boolean
defp dtype_for_agg_operation(:mode, series), do: {:list, series.dtype}
defp dtype_for_agg_operation(_, _), do: {:f, 64}

I'm not entirely sure it's correct. But it seems to be?

>>> df = pl.DataFrame({"a": [1.1, 2.2]}, schema={"a": pl.datatypes.Decimal(18, 2)})
>>> df.with_columns(b=pl.col("a").mean())
shape: (2, 2)
┌───────────────┬──────┐
│ ab    │
│ ------  │
│ decimal[18,2] ┆ f64  │
╞═══════════════╪══════╡
│ 1.101.65 │
│ 2.201.65 │
└───────────────┴──────┘

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine to default our decimal operations to float and fix them as Polars start returning decimals for them too. I would prefer that than raising (because returning float still "works").

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim Sorry forgot about this. Are you saying we should remove this test and force our decimal operations to return float instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return whatever Polars return for now and we will fix it naturally as Polars fixes it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then the test can stay. When Polars fixes their end, we will learn about it because the test will break. Then we can convert it to a test of what the behavior should actually be.


test "unchecked mean returns nil", %{df: df} do
# Here we recreate the internals of `DF.summarise(df, a: mean(a))` to avoid
# the invalid dtype expectation. What we should get is a decimal that
# represents the mean, but what we do get is `nil`.
qf = Explorer.Query.new(df)
ldf = PolarsBackend.DataFrame.lazy(df)
lazy_mean_a = Explorer.Series.mean(qf["a"])

expr =
lazy_mean_a.data
|> PolarsBackend.Expression.to_expr()
|> PolarsBackend.Expression.alias_expr("a")

df =
with {:ok, pdf1} <- PolarsBackend.Native.lf_summarise_with(ldf.data, [], [expr]),
{:ok, pdf2} <- PolarsBackend.Native.lf_compute(pdf1),
do: PolarsBackend.Shared.create_dataframe!(pdf2)

assert Explorer.Series.to_list(df["a"]) == [nil]
end
end
Loading