Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] udf_test udf_cudf_test failed require_minimum_pandas_version check in spark 320+ #4378

Closed
pxLi opened this issue Dec 17, 2021 · 6 comments · Fixed by #4419 or #4433
Closed

[BUG] udf_test udf_cudf_test failed require_minimum_pandas_version check in spark 320+ #4378

pxLi opened this issue Dec 17, 2021 · 6 comments · Fixed by #4419 or #4433
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@pxLi
Copy link
Member

pxLi commented Dec 17, 2021

Describe the bug
seems pandas lib in spark 320+ is incompatible w/ the one installed as cudf's dep

[2021-12-17T11:14:48.244Z] ==================================== ERRORS ====================================
[2021-12-17T11:14:48.244Z] ______________ ERROR collecting src/main/python/udf_cudf_test.py _______________
[2021-12-17T11:14:48.244Z] ../../src/main/python/udf_cudf_test.py:21: in <module>
[2021-12-17T11:14:48.244Z]     require_minimum_pandas_version()
[2021-12-17T11:14:48.244Z] /home/jenkins/agent/workspace/jenkins-rapids_cudf_udf-dev-github-33-cuda11.2/jars/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py:27: in require_minimum_pandas_version
[2021-12-17T11:14:48.244Z]     import pandas
[2021-12-17T11:14:48.244Z] ../../../spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas/__init__.py:31: in <module>
[2021-12-17T11:14:48.244Z]     require_minimum_pandas_version()
[2021-12-17T11:14:48.244Z] /home/jenkins/agent/workspace/jenkins-rapids_cudf_udf-dev-github-33-cuda11.2/jars/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py:35: in require_minimum_pandas_version
[2021-12-17T11:14:48.244Z]     if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
[2021-12-17T11:14:48.244Z] E   AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
[2021-12-17T11:14:48.244Z] 
[2021-12-17T11:14:48.244Z] During handling of the above exception, another exception occurred:
[2021-12-17T11:14:48.244Z] ../../src/main/python/udf_cudf_test.py:24: in <module>
[2021-12-17T11:14:48.244Z]     raise AssertionError("incorrect pandas version during required testing " + str(e))
[2021-12-17T11:14:48.244Z] E   AssertionError: incorrect pandas version during required testing partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
[2021-12-17T11:14:48.244Z] _________________ ERROR collecting src/main/python/udf_test.py _________________
[2021-12-17T11:14:48.244Z] ../../src/main/python/udf_test.py:21: in <module>
[2021-12-17T11:14:48.244Z]     require_minimum_pandas_version()
[2021-12-17T11:14:48.244Z] /home/jenkins/agent/workspace/jenkins-rapids_cudf_udf-dev-github-33-cuda11.2/jars/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py:27: in require_minimum_pandas_version
[2021-12-17T11:14:48.244Z]     import pandas
[2021-12-17T11:14:48.244Z] ../../../spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas/__init__.py:31: in <module>
[2021-12-17T11:14:48.244Z]     require_minimum_pandas_version()
[2021-12-17T11:14:48.244Z] /home/jenkins/agent/workspace/jenkins-rapids_cudf_udf-dev-github-33-cuda11.2/jars/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py:35: in require_minimum_pandas_version
[2021-12-17T11:14:48.244Z]     if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
[2021-12-17T11:14:48.244Z] E   AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
[2021-12-17T11:14:48.244Z] 
[2021-12-17T11:14:48.244Z] During handling of the above exception, another exception occurred:
[2021-12-17T11:14:48.244Z] ../../src/main/python/udf_test.py:24: in <module>
[2021-12-17T11:14:48.244Z]     raise AssertionError("incorrect pandas version during required testing " + str(e))
[2021-12-17T11:14:48.244Z] E   AssertionError: incorrect pandas version during required testing partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 17, 2021
@tgravescs
Copy link
Collaborator

spark bumped up the minimum pandas version to 1.0.5 with apache/spark@3657703

That change went into 3.3.0, branch 3.2 does not have that change, so I'm curious why this started failing unless cudf changed their version but then I would expect it to fail other places. 3.2 shipped requiring pandas version 0.23.2 (https://github.com/apache/spark/blob/v3.2.0/python/pyspark/sql/pandas/utils.py#L23)

Cudf seems to require: pandas>=1.0,<1.4.0dev0 which hasn't changed recently.
https://github.com/rapidsai/cudf/blob/branch-22.02/conda/environments/cudf_dev_cuda11.5.yml#L19

From the jenkinsfile for this build it looks like we are using 11.0 and 11.2 cuda images, which I don't think are supported any more. I think we need to change to the 11.5. @pxLi @NvTimLiu @GaryShen2008 could you take a look ?

@tgravescs tgravescs added the P0 Must have for release label Dec 20, 2021
@NvTimLiu NvTimLiu self-assigned this Dec 21, 2021
@NvTimLiu
Copy link
Collaborator

I'll check this issue.

@NvTimLiu
Copy link
Collaborator

Seems we're not importing the real pandas module when run cudf-udf tests, we have a directory in spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas under the PYTHONPATH environment variable : https://github.com/NVIDIA/spark-rapids/blob/branch-22.02/jenkins/spark-tests.sh#L109

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Dec 21, 2021
@sameerz sameerz added this to the Dec 13 - Jan 7 milestone Dec 21, 2021
@NvTimLiu
Copy link
Collaborator

spark bumped up the minimum pandas version to 1.0.5 with apache/spark@3657703

That change went into 3.3.0, branch 3.2 does not have that change, so I'm curious why this started failing unless cudf changed their version but then I would expect it to fail other places. 3.2 shipped requiring pandas version 0.23.2 (https://github.com/apache/spark/blob/v3.2.0/python/pyspark/sql/pandas/utils.py#L23)

Cudf seems to require: pandas>=1.0,<1.4.0dev0 which hasn't changed recently. https://github.com/rapidsai/cudf/blob/branch-22.02/conda/environments/cudf_dev_cuda11.5.yml#L19

From the jenkinsfile for this build it looks like we are using 11.0 and 11.2 cuda images, which I don't think are supported any more. I think we need to change to the 11.5. @pxLi @NvTimLiu @GaryShen2008 could you take a look ?

As CUDA11.5 official docker images are not available until now here: https://hub.docker.com/r/nvidia/cuda/tags?page=1&ordering=last_updated, so we are still using CUDA11.0/11.2 runtime.

Will update to 11.5 once the official images are online @tgravescs

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Dec 22, 2021

Reason for the failure:

  • There is a pandas python package in the dir: spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas, and our spark-tests.sh#L109 include it via PYTHONPATH

  • This spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas causes the ERROR details as below:

[root@0de42b9f44bd /]# python --version
Python 3.8.12
[root@0de42b9f44bd /]# export PYTHONPATH=/jars/spark-3.2.0-bin-hadoop3.2/python:/jars/spark-3.2.0-bin-hadoop3.2/python/pyspark/:/jars/spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip
>>> import pandas
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/jars/spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas/__init__.py", line 31, in <module>
    require_minimum_pandas_version()
  File "/jars/spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/pandas/utils.py", line 35, in require_minimum_pandas_version
    if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
  • There is no such dir: python/pyspark/pandas under spark-3.1.x or earlier versions, so this issue only happens on the spark-3.2.0 or later versions

To fix:

  • Put conda package path ahead of the env 'PYTHONPATH', to import the right pandas from conda instead of spark3.2.0 or later binary path.
  [root@0de42b9f44bd]# export PYTHONPATH=/opt/conda/lib/python3.8/site-packages:/jars/spark-3.2.0-bin-hadoop3.2/python:/jars/spark-3.2.0-bin-hadoop3.2/python/pyspark/:/jars/spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip
[root@0de42b9f44bd spark-rapids]# python
Python 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> pandas.__version__
'1.3.5'
>>>

@NvTimLiu NvTimLiu linked a pull request Dec 23, 2021 that will close this issue
@NvTimLiu
Copy link
Collaborator

close as #4419 merged

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
4 participants