Skip to content

BUG: Impossible creation of array with dtype=string #61263

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

Manju080
Copy link
Contributor

@Manju080 Manju080 commented Apr 9, 2025

closes #61155

Hello @rhshadrach ,

I’ve created a fix that raises a ValueError when trying to create a StringArray from a list of lists with inconsistent lengths or non-character elements. This aligns the behavior for both consistent and inconsistent input formats and also tested.

I've would like to hear opinion to raise an error when a list of lists is passed for dtype=StringDtype, to avoid ambiguous behavior. If preferred, we could instead join the inner lists into strings automatically — happy to adjust based on guidance.
Example case : pd.array([["t", "e", "s", "t"], ["w", "o", "r", "d"]], dtype="string")
output : <StringArray> ['test', 'word'] Length: 2, dtype: string

Thanks

@rhshadrach
Copy link
Member

Also, please add a test for this.

@Manju080 Manju080 requested a review from WillAyd as a code owner April 15, 2025 17:05
@simonjayhawkins simonjayhawkins changed the title Bugfix 61155 BUG: Impossible creation of array with dtype=string Apr 16, 2025
@simonjayhawkins simonjayhawkins added Bug Strings String extension data type and string data labels Apr 16, 2025
@Manju080
Copy link
Contributor Author

pre-commit.ci autofix

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use pytest for testing, you'll need to add a test using that format. See here:

https://pandas.pydata.org/pandas-docs/dev/development/contributing_codebase.html#using-pytest

The general pytest introduction may also be useful:

https://docs.pytest.org/en/7.1.x/getting-started.html

@Manju080
Copy link
Contributor Author

We use pytest for testing, you'll need to add a test using that format. See here:

https://pandas.pydata.org/pandas-docs/dev/development/contributing_codebase.html#using-pytest

The general pytest introduction may also be useful:

https://docs.pytest.org/en/7.1.x/getting-started.html

Thank you for the details, will work on it

@Manju080
Copy link
Contributor Author

@rhshadrach I’ve been testing the following case in test_lib.py

def test_ensure_string_array_list_of_lists():
# GH#61155: ensure list of lists doesn't get converted to string
arr = [['t', 'e', 's', 't'], ['w', 'o', 'r', 'd']]
result = lib.ensure_string_array(arr)

# Each item in result should still be a list, not a stringified version
assert isinstance(result[0], list)
assert isinstance(result[1], list)
assert result[0] == ['t', 'e', 's', 't']
assert result[1] == ['w', 'o', 'r', 'd']

However, the test fails with
FAILED pandas/tests/libs/test_lib.py::test_ensure_string_array_list_of_lists - AssertionError
DEBUG RESULT: ["['t', 'e', 's', 't']" "['w', 'o', 'r', 'd']"] <class 'numpy.ndarray'> <class 'str'>

So currently, the list of lists gets converted into a 1D NumPy array of strings.
With the current implementation, arr becomes a 1D object array of lists (as intended), but it seems that downstream processing stringifies each list.
Do you want me to guard against this case inside ensure_string_array to preserve the list structure? Or is the stringification expected behavior in this context?

Thanks!

@rhshadrach
Copy link
Member

rhshadrach commented Apr 25, 2025

I believe converting to a 1-dimesional ndarray of strings is the expected behavior of enusure_string_array. Perhaps I'm misunderstanding; what is the alternative?

@Manju080
Copy link
Contributor Author

Manju080 commented May 5, 2025

Thanks for the clarification!

You're right — the behavior of ensure_string_array producing a 1D ndarray of stringified inner lists (when given a list of lists like [list("test"), list("word")]) is consistent with the current expectations of the function.

def test_ensure_string_array_list_of_lists():
arr = [list("test"), list("word")]
result = lib.ensure_string_array(arr)
assert isinstance(result, np.ndarray)
assert result.dtype == object
assert result[0] == "['t', 'e', 's', 't']"
assert result[1] == "['w', 'o', 'r', 'd']"
print("DEBUG RESULT:", result)

My initial assumption was that it should preserve the list structure instead of converting to strings, but after re-evaluating and running the test, I see that the 1D array of strings is indeed the intended behavior. The test has now been updated and passes successfully and got the below output
[1/1] Generating write_version_file with a custom command
================================================= test session starts
==================================================
platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.5.0
rootdir: /mnt/c/Users/HP/Documents/Python_pandas_op/pandas
configfile: pyproject.toml
plugins: hypothesis-6.131.6
collected 84 items
pandas/tests/libs/test_lib.py ...................................................................................DEBUG RESULT: ["['t', 'e', 's', 't']" "['w', 'o', 'r', 'd']"]`

----------------- generated xml file: /mnt/c/Users/HP/Documents/Python_pandas_op/pandas/test-data.xml ------------------
================================================= slowest 30 durations
=================================================
0.09s setup pandas/tests/libs/test_lib.py::TestMisc::test_max_len_string_array

(29 durations < 0.005s hidden. Use -vv to show these durations.)

Please let me know if I need to change anything

@rhshadrach
Copy link
Member

@Manju080 - the last change I'm seeing is from 3 weeks ago. Perhaps you need to push some commits?

@Manju080
Copy link
Contributor Author

Manju080 commented May 6, 2025

That's right, I just wanna make sure before committing the changes.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com>
@Manju080
Copy link
Contributor Author

Manju080 commented May 8, 2025

Apologies for the causing confusion, I will work this to fix.

@Manju080
Copy link
Contributor Author

Manju080 commented May 8, 2025

@rhshadrach Thank you very much, required changes are done.
Let me know if there is anything

@Manju080
Copy link
Contributor Author

pre-commit.ci autofix

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Bug Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Impossible creation of array with dtype=string
3 participants