You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Users should be able to accessing a dataframe element–that is an Arrow-backed map–with normal python dict semantics.
Today, accessing an Arrow-backed map element will return a list of tuples per as_py() from MapScalar type–thus list semantics and not dictionary access semantics. Historically, this is because Arrow allows multiple keys, and ordering is not enforced. So converting to a python dictionary removes those two behaviors. (1) multiple keys will be removed and (2) the ordering may be changed. In practice, this is not the common case, and so it makes the common case hard.
The common case is that users want to interact with a map with traditional key/value access semantics. It's often a burden and source of confusion when users need to manually convert, a la
# pseudocode
df = table.to_pandas(types_mapper=pd.ArrowDtype)
my_dict = df["col_a"].iloc[0]
val = my_dict["key"] # error, no key/value access semantics
val = dict(my_dict)["key"] # users need to manually convert to a dict on each access
This behavior should also be available when using imperative iteration based methods like .iterrows(), which is another common patter for accessing element-by-element.
Feature Description
We can have a configuration for this in ArrowExtensionArray.
Arrow already has a maps_as_pydicts flag: .to_pandas(maps_as_pydicts=True) which controls this behavior only when not using pyarrow backed data frames (when using numpy backed data frames). This feature is already widely used in at last one large company.
There's not an obvious place to put this in the types_mapper API. But, we can already see unexpected behavior when combining maps_as_pydicts=True with the types_mapper=pd.ArrowDtype
# pseudocode
df = table.to_pandas(types_mapper=pd.ArrowDtype, maps_as_pydicts=True)
# my_dict is still a `MapScalar`!!
my_dict = df["col_a"].iloc[0]
When combined, maps_as_pydicts is effectively ignored, because the code path taken for types_mapper=pd.ArrowDtype makes no use of the flag.
So, this is all to say, when we see both of those flags, we should propagate the configuration to Pandas, so that it will use it during element access 1, 2
Such a change requires changes in both Arrow and Pandas.
Alternative Solutions
Alternatively, we can save some state in the underlying pyarrow array, so that calling as_py() on the MapScalar will automatically do the right thing.
Some breadcrumbs for context:
a MapScalar is generated when accessing a pyarrow MapArray 1, 2
this is accessed when retrieving an element from an ArrowExtensionArray1, 2
So, one can imagine that this information is saved in the MapArray/Table itself. However, that also introduces action at a distance when converting a table to a dataframe, and then performing element access. It would be more straightforward to configure this during the conversion to Pandas and holding that configuration state in the dataframe.
Another partial alternative is making a .mapaccessor. I lack context on these accessors and don't know if they are an obvious solution, or a ham-fisted one.
Additional Context
Performance can be a consideration. When doing an element access, we'd be doing a conversion from the native Arrow array to a Python dictionary.
However, this is already the case. Element access on a MapScalar already traverses the underlying MapArray and coverts it to a python list 1, 2
The text was updated successfully, but these errors were encountered:
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Users should be able to accessing a dataframe element–that is an Arrow-backed map–with normal python dict semantics.
Today, accessing an Arrow-backed map element will return a list of tuples per
as_py()
fromMapScalar
type–thus list semantics and not dictionary access semantics. Historically, this is because Arrow allows multiple keys, and ordering is not enforced. So converting to a python dictionary removes those two behaviors. (1) multiple keys will be removed and (2) the ordering may be changed. In practice, this is not the common case, and so it makes the common case hard.The common case is that users want to interact with a map with traditional key/value access semantics. It's often a burden and source of confusion when users need to manually convert, a la
This behavior should also be available when using imperative iteration based methods like
.iterrows()
, which is another common patter for accessing element-by-element.Feature Description
We can have a configuration for this in
ArrowExtensionArray
.Arrow already has a
maps_as_pydicts
flag:.to_pandas(maps_as_pydicts=True)
which controls this behavior only when not using pyarrow backed data frames (when using numpy backed data frames). This feature is already widely used in at last one large company.The flag will generate a native python dictionary instead of a python list of
(key, value)
tuples. This flag has also made its way to lower-level apis and come up with competing dataframe libraries.There's not an obvious place to put this in the
types_mapper
API. But, we can already see unexpected behavior when combiningmaps_as_pydicts=True
with thetypes_mapper=pd.ArrowDtype
When combined,
maps_as_pydicts
is effectively ignored, because the code path taken fortypes_mapper=pd.ArrowDtype
makes no use of the flag.So, this is all to say, when we see both of those flags, we should propagate the configuration to Pandas, so that it will use it during element access 1, 2
Such a change requires changes in both Arrow and Pandas.
Alternative Solutions
Alternatively, we can save some state in the underlying pyarrow array, so that calling
as_py()
on theMapScalar
will automatically do the right thing.Some breadcrumbs for context:
MapScalar
is generated when accessing a pyarrow MapArray 1, 2ArrowExtensionArray
1, 2So, one can imagine that this information is saved in the
MapArray
/Table
itself. However, that also introduces action at a distance when converting a table to a dataframe, and then performing element access. It would be more straightforward to configure this during the conversion to Pandas and holding that configuration state in the dataframe.Another partial alternative is making a
.map
accessor. I lack context on these accessors and don't know if they are an obvious solution, or a ham-fisted one.Additional Context
Performance can be a consideration. When doing an element access, we'd be doing a conversion from the native
Arrow
array to a Python dictionary.However, this is already the case. Element access on a
MapScalar
already traverses the underlyingMapArray
and coverts it to a python list 1, 2The text was updated successfully, but these errors were encountered: