-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
BUG: Inconsistent results using pd.json_normalize() on a generator object versus list (off by one) #35923
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
I don't believe the documentation states that a generator is an accepted value for
that said the issue is here: https://github.com/pandas-dev/pandas/blob/v1.1.1/pandas/io/json/_normalize.py#L269-L279 specifically
this consumes the first yieded result from the generator |
Ahh, I see. This was my first time passing a generator to json_normalize and it seemed like it worked since I had many records. Perhaps a warning or error could be raised if a generator is passed to this method. Shall I close this now? |
@WillAyd is this intended to be supported? |
Losing the first record is a nasty surprise - would take a patch here for sure |
@WillAyd did some debugging and found out the issue is caused by this line https://github.com/pandas-dev/pandas/blob/master/pandas/io/json/_normalize.py#L270 The for loop with
I am favouring the additional runtime option, since I think the user will only provide a generator if there are memory constraints. But LMK if you see this differently |
take |
Uh oh!
There was an error while loading. Please reload this page.
[ x] I have checked that this issue has not already been reported.
[ x] I have confirmed this bug exists on the latest version of pandas.
Code Sample, a copy-pastable example
Only one value is returned with this:
This returns all values though:
And so does this:
Problem description
Using pd.json_normalize() on a generator always seems to reduce the expected results by 1. I first noticed this on a REST API where a column informed me that I should expect 901 results but I kept getting 900 results each time. When I tried to append the results to a list and normalize that, I got the expected 901 results.
Expected Output
Perhaps this is an expected output. It just caused me some headaches earlier and it was not immediately obvious that I was missing one record. I would expect that my example above would result in the same 2 row DataFrame.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : d9fff27
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-1028-azure
Version : #29~18.04.1-Ubuntu SMP Fri Jun 5 14:32:34 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0.post20200814
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.17.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.3.0
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 1.0.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : 3.6.1
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.48.0
The text was updated successfully, but these errors were encountered: