Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Split nested dict columns #96

Closed
FelipeSBarros opened this issue Dec 21, 2023 · 21 comments
Closed

Split nested dict columns #96

FelipeSBarros opened this issue Dec 21, 2023 · 21 comments
Labels
enhancement New feature or request

Comments

@FelipeSBarros
Copy link
Owner

I was reading about the occurrences endpoint and playing around with the package when I relized that some columns are returned as a Python dictionary with adicional data related with the violence case.

For example, alfter getting some data for Rio de Janeiro:

import pandas as pd
from crossfire import AsyncClient

client = AsyncClient()
states_dict, _ = await client.states(format="dict")
states_df, _ = await client.states(format="df")

cities_rj, _ = await client.cities(state_id=states_df.id.iloc[0], format="df")

occs = await client.occurrences(
    id_state=cities_rj.state.iloc[16].get("id"),
    id_cities=cities_rj.id.iloc[16],
    initial_date="2019-01-01",
    final_date="2019-01-01",
    format="df",
)

If I was interested on analyzing the main reasons related to the violence cases, I would need to "unpack" the contextInfo column and, then, the mainReason, transforming it to a Pandas.Series, joining them with the "original" DataFrame, dopping the flattened column and renaming it to mainReason to keep the meaning of the column:

# check columns from Occurrences request
>>> occs.columns
Index(['id', 'documentNumber', 'address', 'state', 'region', 'city',
       'neighborhood', 'subNeighborhood', 'locality', 'latitude', 'longitude',
       'date', 'policeAction', 'agentPresence', 'relatedRecord', 'contextInfo',
       'transports', 'victims', 'animalVictims'],
      dtype='object')

# flatten/split contextInfo into different columns and concatenate in the "original" DataFrame
>>> occs = pd.concat(
    [occs.drop(["contextInfo"], axis=1), occs["contextInfo"].apply(pd.Series)], axis=1
)

# checking columns names after flettening
>>> occs.columns
Index(['id', 'documentNumber', 'address', 'state', 'region', 'city',
       'neighborhood', 'subNeighborhood', 'locality', 'latitude', 'longitude',
       'date', 'policeAction', 'agentPresence', 'relatedRecord', 'transports',
       'victims', 'animalVictims', 'mainReason', 'complementaryReasons',
       'clippings', 'massacre', 'policeUnit'],
      dtype='object')

# Flatten/split the mainReason column to keep only the mainReason name and renaming it to mainReason
>>> occs = pd.concat(
    [occs.drop(["mainReason"], axis=1), occs["mainReason"].apply(pd.Series).name.rename('mainReason')], axis=1
)
>>> occs.columns
Index(['id', 'documentNumber', 'address', 'state', 'region', 'city',
       'neighborhood', 'subNeighborhood', 'locality', 'latitude', 'longitude',
       'date', 'policeAction', 'agentPresence', 'relatedRecord', 'transports',
       'victims', 'animalVictims', 'complementaryReasons', 'clippings',
       'massacre', 'policeUnit', 'mainReason'],
      dtype='object')

>>> occs.mainReason
0        Tiros a esmo
1    Não identificado
2    Não identificado
3    Não identificado
4    Não identificado
5    Não identificado
6    Não identificado
7       Ação policial
8    Não identificado
Name: mainReason, dtype: object

# counting occurrences grouping by mainReason
>>> occs.groupby("mainReason").size()
mainReason
Ação policial       1
Não identificado    7
Tiros a esmo        1
dtype: int64

So I thought that perhaps woud be interesting a method that do that "automagically".

Beyond
contextInfo, transport, victims and animalVictims could be candidates of this process.

So I thought something like occs.flatten() to have all solumns with dict transfomed in separeted columns.
Also, this potential method could be applied to a specific column, like occs.flatten("contextInfo")to flatten only the contextInfo column.
And also something like occs.flatten("contextInfo", flatten_all=True) to have all nested dicts inside contextInfo flatenned in differents columns.
What is your opinion, @cuducos ?

@FelipeSBarros FelipeSBarros added the enhancement New feature or request label Dec 21, 2023
@FelipeSBarros
Copy link
Owner Author

Hey, @cuducos , I am back.
Before start in the task I would like to hear your opinion about the name of this potential function.
Which one do you believe translate better its functionality:

  1. flatten
  2. unpack
  3. unwrap

Would you like to suggest any other?
Best regards

@cuducos
Copy link
Collaborator

cuducos commented Feb 9, 2024

I think flatten is more used, but might be just my impression : )

@FelipeSBarros
Copy link
Owner Author

FelipeSBarros commented Feb 11, 2024

I think flatten is more used, but might be just my impression : )

No problem. I believe that your impression has mucho more context info, them mine...

By the way, I was reviewing my proposal, and I am realizing that occs object in the example gave is a GeoDataFrame, but It could be a dict or DataFrame. And because of that, in the gave example, the idea of having a method won't work (I believe that the expression don't work is not what I mean, but couldn't find a better one).
Perhaps I could add a new parameter to Occurrences class to implement this flatten process during the data aquisition. Like, Occurrences(.., flatten=True) or Occurrences(.., flatten="contextInfo") ;
I don't like this approach because if the user realize the necessity of flatten the data after having it, he would need to adquire data again.

Another approach would be to have a function that flatten the columns according to its class, no needing to process the data adquirement again.

What do yout think?

@FelipeSBarros
Copy link
Owner Author

@cuducos , before I start implementing the flatten fucntion for GeoDataFrame, I wonder myself how would you follow the implementation.
For now we have a flatten function that workd for dicts. Now I should improve se the flatten functionwork also forGeoDataFrame`.
I can see a few ways of doing that:

  1. Refactoring flatten fucntion transforming it on a parser according to the data class input. So, passing the dicts to a new function flatten_dict, for instance. Then start to develop flatten_gdf, for GeoDataFrame.
  2. Build a completely new function flatten_gdf, for instanceto flattenGeoDataFrameand latter, once working, refactor both, makinfflatten` a "parser".

What is your opinion?

@cuducos
Copy link
Collaborator

cuducos commented Feb 22, 2024

My suggestion would be slightly different:

  1. Write a test that passes a pandas.DataFrame as data to the existing flatten and expect the response as desired
  2. Make the test pass
  3. Refactor if needed (I cannot ponder on whether we need a flatten_df without seeing the code we need for that, for example)

@FelipeSBarros
Copy link
Owner Author

My suggestion would be slightly different:

  1. Write a test that passes a pandas.DataFrame as data to the existing flatten and expect the response as desired
  2. Make the test pass
  3. Refactor if needed (I cannot ponder on whether we need a flatten_df without seeing the code we need for that, for example)

Great. I will work on that.

@cuducos
Copy link
Collaborator

cuducos commented Feb 22, 2024

I'll leave this just as a provocation:

def add(x: int, y: int):
    return x + y


def add(x: str, y: str):
    return int(x) + int(y)


print(add(40, 2))
print(add("40", "2"))

@FelipeSBarros
Copy link
Owner Author

@cuducos , before merging I decided to try the solution and make some experiments and come with two concerns. Perhaps one is a potential issue.

  1. During my manual tests I could find that locality column may have None values, raising AttributeError:
    AttributeError: 'NoneType' object has no attribute 'items';

Not sure if this could happen (is expected) to other nested columns. I will check it with the Tech lead of Fogo Cruzado.
But anyway, do you think we should consider a validation before running the flatten?

  1. Also, I realized that, the nested column mainReason, on the contextInfo, return a dict with ID and the shooting reason:
    {'id': 'baa3b299-67ad-41d2-aaf0-23ec8288cadb', 'name': 'Homicidio/Tentativa'}.
    So, I am wondering if it worth making the flatten function flatten not only the nested_columns but also its child columns, in case they has a dict value inside.

What do you think?

Here is the code snipped with both points:

from crossfire import occurrences
from crossfire.clients.occurrences import flatten

occs = occurrences(id_state='813ca36b-91e3-4a18-b408-60b27a1942ef',
                id_cities='5bd3bfe5-4989-4bc3-a646-fe77a876fce0',
                initial_date='2018-04-01', format='geodf')

# point 1
occs.locality.head()
# 0                                                 None
# 1                                                 None
# 2    {'id': '9d8b21a8-2d6e-414d-83f7-c53fbc6db488',...
# 3                                                 None
# 4                                                 None
# Name: locality, dtype: object

flattened_occs = flatten(occs, nested_columns=['locality'])
# Traceback (most recent call last):
#  File "/opt/pycharm-community-2022.3.2/plugins/python-ce/helpers/pydev/pydevconsole.py", line 364, in runcode
#    coro = func()
#  File "<input>", line 1, in <module>
#  File "/home/felipe/repos/crossfire/crossfire/clients/occurrences.py", line 228, in flatten
#    data = _flatten_df(data, nested_columns)
#  File "/home/felipe/repos/crossfire/crossfire/clients/occurrences.py", line 199, in _flatten_df
#    data.apply(_flatten_col, args=(key,), axis=1),
#  File "/home/felipe/.cache/pypoetry/virtualenvs/crossfire-WqfJa-6U-py3.10/lib/python3.10/site-packages/geopandas/geodataframe.py", line 1581, in apply
#    result = super().apply(
#  File "/home/felipe/.cache/pypoetry/virtualenvs/crossfire-WqfJa-6U-py3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 10034, in apply
#    return op.apply().__finalize__(self, method="apply")
#  File "/home/felipe/.cache/pypoetry/virtualenvs/crossfire-WqfJa-6U-py3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 837, in apply
#    return self.apply_standard()
#  File "/home/felipe/.cache/pypoetry/virtualenvs/crossfire-WqfJa-6U-py3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 963, in apply_standard
#    results, res_index = self.apply_series_generator()
#  File "/home/felipe/.cache/pypoetry/virtualenvs/crossfire-WqfJa-6U-py3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 979, in apply_series_generator
#    results[i] = self.func(v, *self.args, **self.kwargs)
#  File "/home/felipe/repos/crossfire/crossfire/clients/occurrences.py", line 190, in _flatten_col
#    for key, value in column_data.items()
# AttributeError: 'NoneType' object has no attribute 'items'


# Point 2: showing the returned `mainReason` returned value:
flattened_occs = flatten(occs, nested_columns=['contextInfo'])
flattened_occs.contextInfo_mainReason.iloc[0]
# {'id': 'baa3b299-67ad-41d2-aaf0-23ec8288cadb', 'name': 'Homicidio/Tentativa'}

@cuducos
Copy link
Collaborator

cuducos commented Apr 9, 2024

I could find that locality column may have None values

I am dying to see a test with this in test_flatten.py ; )

So, I am wondering if it worth making the flatten function flatten not only the nested_columns but also its child columns, in case they has a dict value inside.

It makes sense, but be aware I am not a heavy user of Pandas and GeoPandas… maybe flattening stuff might be optional?

@FelipeSBarros
Copy link
Owner Author

I am dying to see a test with this in test_flatten.py ; )

Don't worry, ASAP will be done.

It makes sense, but be aware I am not a heavy user of Pandas and GeoPandas…

But this is not a Pandas/GeoPandas related issue. It affects dict also;

from crossfire import occurrences
from crossfire.clients.occurrences import flatten

occs = occurrences(id_state='813ca36b-91e3-4a18-b408-60b27a1942ef',
                id_cities='5bd3bfe5-4989-4bc3-a646-fe77a876fce0',
                initial_date='2018-04-01')#, format='geodf')

flattened_occs = flatten(occs)
# Traceback (most recent call last):
#   File "/opt/pycharm-community-2022.3.2/plugins/python-ce/helpers/pydev/pydevconsole.py", line 364, in runcode
#     coro = func()
#   File "<input>", line 1, in <module>
#   File "/home/felipe/repos/crossfire/crossfire/clients/occurrences.py", line 231, in flatten
#     data = _flatten_list(data, nested_columns)
#   File "/home/felipe/repos/crossfire/crossfire/clients/occurrences.py", line 210, in _flatten_list
#     item.update({f"{key}_{k}": v for k, v in item.get(key).items()})
# AttributeError: 'NoneType' object has no attribute 'items'

maybe flattening stuff might be optional?

Yes, flatten function is an auxiliarly function. It is not excuted automatically. Parhaps you was wondering we should keep contextInfo as the main nested_columns and all others, optional?

What about this second point?

2. Also, I realized that, the nested column mainReason, on the contextInfo, return a dict with ID and the shooting reason:
{'id': 'baa3b299-67ad-41d2-aaf0-23ec8288cadb', 'name': 'Homicidio/Tentativa'}.
So, I am wondering if it worth making the flatten function flatten not only the nested_columns but also its child columns, in case they has a dict value inside.

@cuducos
Copy link
Collaborator

cuducos commented Apr 10, 2024

But this is not a Pandas/GeoPandas related issue. It affects dict also;

That is precisely why I said that. dict accommodates pretty well nested data structures. Pandas doesn't. I really don't mind nested dictionaries : ) So… again:

maybe flattening stuff might be optional?

@FelipeSBarros
Copy link
Owner Author

maybe flattening stuff might be optional?

@cuducos I think I din't get your point when you mention that. Could you explain?
Because, as I said have mentioned before, the process os flattening is done after acquiring Occurrences dataset and, only if user want to. It is not being used in the process os data acquisition. So, I understand it as "optional".
But, even being "optinal", a unadvised user might try to flatten those keys or columns that might have register whithout value None, returning the beforementioned error...

So, just to confirm my thought (and what I think I have learned):

  1. We need to decide how to deal with it or how to manage the error in the flatten function before merging the PR Update flatten to accept DataFrame and GeoDataFrame #103 ;
  2. Considering that I still work to be done on the Update flatten to accept DataFrame and GeoDataFrame #103 , do you think worth expand it and find a way to work with the following point?:

Also, I realized that, the nested column mainReason, on the contextInfo, return a dict with ID and the shooting reason:
{'id': 'baa3b299-67ad-41d2-aaf0-23ec8288cadb', 'name': 'Homicidio/Tentativa'}.
So, I am wondering if it worth making the flatten function flatten not only the nested_columns but also its child columns, in case they has a dict value inside.

@cuducos
Copy link
Collaborator

cuducos commented Apr 15, 2024

Where is it documented? I think I made wrong assumptions, tried to check the docs and found nothing

@FelipeSBarros
Copy link
Owner Author

Where is it documented? I think I made wrong assumptions, tried to check the docs and found nothing

Ouch. It is not documented yet.
I will work on this task and commit in the PR #103 , ok?

@FelipeSBarros
Copy link
Owner Author

@cuducos , documentation added. Please let me know if it is well written or if I missed something.

@cuducos
Copy link
Collaborator

cuducos commented Apr 21, 2024

Ok. I think this works but is suboptimal:

occurrences('813ca36b-91e3-4a18-b408-60b27a1942ef', format='df')  # giver nested fields
occurrences('813ca36b-91e3-4a18-b408-60b27a1942ef', format='df', flat=True)  # gives flatten data
occurrences('813ca36b-91e3-4a18-b408-60b27a1942ef', format='df', flat=False)  # giver nested fields

I don't see a reason for us to expose the flatten function itself, but its functionality ; )

@FelipeSBarros
Copy link
Owner Author

Ok. I think this works but is suboptimal:

occurrences('813ca36b-91e3-4a18-b408-60b27a1942ef', format='df')  # giver nested fields
occurrences('813ca36b-91e3-4a18-b408-60b27a1942ef', format='df', flat=True)  # gives flatten data
occurrences('813ca36b-91e3-4a18-b408-60b27a1942ef', format='df', flat=False)  # giver nested fields

I don't see a reason for us to expose the flatten function itself, but its functionality ; )

But if some one, for some reason, get the data without enabling flat, would it be posible to have it as a function as well? So in this case the user han't to retrieve the data again?

Also, using its functionality in occurrences(), how should handle the problem mentioned before;

@cuducos
Copy link
Collaborator

cuducos commented Apr 23, 2024

how should handle the problemmentioned before

The way I said before.

would it be posible to have it as a function as well?

From the Zen of Python:

There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.

@FelipeSBarros
Copy link
Owner Author

The way I said before.

OK, I have just added a test with a list of dicts with and without nested columns; Test is not passing, as I am interested to see how would you guide me on handling this situation. Let's discuss the potential solution in the PR #103 ,right?

From the Zen of Python:

There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.

Makes sense. But I am not convinced that the way you suggested is THE way of doing it.
As I am thinling of someone that is not used with the data, perharps will request some occurrences, them will notice the nested columns, and could get it flattened without requesting all the dat again. But, perhaps I am not seeing mothing you do. But I am willing to follow your suggesttion, anyway.

@cuducos
Copy link
Collaborator

cuducos commented Apr 25, 2024

My point is that the function flatten has absolutely no use outside of the context of dealing with occurrences.

So, a user unfamiliar with the data, API, and package might do what you said. However, another unfamiliar one might try to use it with states or with other data and get an unexpected error.

So, my point is:

  • your design choice (making flatten publicly available) exposes one more source of possible errors
  • your design choice is thinking about the worst-case scenario

But, still… you don't have to agree with me, and you don't have to design the APi the way I would : )

@FelipeSBarros
Copy link
Owner Author

FelipeSBarros commented Apr 26, 2024

My point is that the function flatten has absolutely no use outside of the context of dealing with occurrences.

That is a really good point. The fact is that I thought about ´flatten in the context of occurrences, as I saw some interesting addicional information was in nested columns. But, citites has state column with neted information. But, I don't see any practical use on it. So, only to mention.

from crossfire import cities
c = cities(format = 'df')
c.state.head()
c.state.head()
#0    {'id': 'b112ffbe-17b3-4ad0-8f2a-2038745d1d14',...
#1    {'id': 'b112ffbe-17b3-4ad0-8f2a-2038745d1d14',...
#2    {'id': 'b112ffbe-17b3-4ad0-8f2a-2038745d1d14',...
#3    {'id': 'b112ffbe-17b3-4ad0-8f2a-2038745d1d14',...
#4    {'id': 'b112ffbe-17b3-4ad0-8f2a-2038745d1d14',...
#Name: state, dtype: object

So, a user unfamiliar with the data, API, and package might do what you said. However, another unfamiliar one might try to use it with states or with other data and get an unexpected error.
So, my point is:

  • your design choice (making flatten publicly available) exposes one more source of possible errors
  • your design choice is thinking about the worst-case scenario

Thanks, @cuducos I really apprecitate your comments. I always feel my code somehow messy. And I think this is not only because of the code structure. Perhaps it has relation with considering [allways] those worst-case scenarios;

But, still… you don't have to agree with me, and you don't have to design the APi the way I would : )

Sure. But to take a good decision I need to confirm if my point of view is not wrong or even, if there is a better approach. Thanks for being patient. I will work on your comments on the PR #103

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants