Skip to content

ENH: access arrow-backed map as a python dictionary #61427

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
2 of 3 tasks
mikelui opened this issue May 10, 2025 · 0 comments
Open
2 of 3 tasks

ENH: access arrow-backed map as a python dictionary #61427

mikelui opened this issue May 10, 2025 · 0 comments
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@mikelui
Copy link

mikelui commented May 10, 2025

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Users should be able to accessing a dataframe element–that is an Arrow-backed map–with normal python dict semantics.

Today, accessing an Arrow-backed map element will return a list of tuples per as_py() from MapScalar type–thus list semantics and not dictionary access semantics. Historically, this is because Arrow allows multiple keys, and ordering is not enforced. So converting to a python dictionary removes those two behaviors. (1) multiple keys will be removed and (2) the ordering may be changed. In practice, this is not the common case, and so it makes the common case hard.

The common case is that users want to interact with a map with traditional key/value access semantics. It's often a burden and source of confusion when users need to manually convert, a la

# pseudocode
df = table.to_pandas(types_mapper=pd.ArrowDtype)
my_dict = df["col_a"].iloc[0]

val = my_dict["key"]  # error, no key/value access semantics
val = dict(my_dict)["key"]  # users need to manually convert to a dict on each access

This behavior should also be available when using imperative iteration based methods like .iterrows(), which is another common patter for accessing element-by-element.

Feature Description

We can have a configuration for this in ArrowExtensionArray.

Arrow already has a maps_as_pydicts flag: .to_pandas(maps_as_pydicts=True) which controls this behavior only when not using pyarrow backed data frames (when using numpy backed data frames). This feature is already widely used in at last one large company.

The flag will generate a native python dictionary instead of a python list of (key, value) tuples. This flag has also made its way to lower-level apis and come up with competing dataframe libraries.

There's not an obvious place to put this in the types_mapper API. But, we can already see unexpected behavior when combining maps_as_pydicts=True with the types_mapper=pd.ArrowDtype

# pseudocode
df = table.to_pandas(types_mapper=pd.ArrowDtype, maps_as_pydicts=True)

# my_dict is still a `MapScalar`!! 
my_dict = df["col_a"].iloc[0]

When combined, maps_as_pydicts is effectively ignored, because the code path taken for types_mapper=pd.ArrowDtype makes no use of the flag.

So, this is all to say, when we see both of those flags, we should propagate the configuration to Pandas, so that it will use it during element access 1, 2

Such a change requires changes in both Arrow and Pandas.

Alternative Solutions

Alternatively, we can save some state in the underlying pyarrow array, so that calling as_py() on the MapScalar will automatically do the right thing.

Some breadcrumbs for context:

  • a MapScalar is generated when accessing a pyarrow MapArray 1, 2
  • this is accessed when retrieving an element from an ArrowExtensionArray 1, 2

So, one can imagine that this information is saved in the MapArray/Table itself. However, that also introduces action at a distance when converting a table to a dataframe, and then performing element access. It would be more straightforward to configure this during the conversion to Pandas and holding that configuration state in the dataframe.


Another partial alternative is making a .map accessor. I lack context on these accessors and don't know if they are an obvious solution, or a ham-fisted one.

Additional Context

Performance can be a consideration. When doing an element access, we'd be doing a conversion from the native Arrow array to a Python dictionary.

However, this is already the case. Element access on a MapScalar already traverses the underlying MapArray and coverts it to a python list 1, 2

@mikelui mikelui added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 10, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

1 participant