Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Support barrier mode for mapInPandas/mapInArrow #10364

Merged
merged 1 commit into from
Feb 2, 2024

Conversation

wbo4958
Copy link
Collaborator

@wbo4958 wbo4958 commented Feb 2, 2024

To fix #10344

Spark 3.5 has introduced a new feature supporting barrier mode for mapInPandas/mapInArrow, more detail can be found at https://issues.apache.org/jira/browse/SPARK-42896. However, spark-rapids missed this feature which resulted in unexpected behavior. For example

spark.range(10).mapInPandas(lambda x: x, "id long", True)

The same tasks of the above code will run on barrier mode on the CPU, while on non-barrier mode on the GPU with spark-rapids.

Signed-off-by: Bobby Wang <wbo4958@gmail.com>
@wbo4958 wbo4958 requested a review from firestarman February 2, 2024 01:08
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Feb 2, 2024

build

@@ -425,3 +426,41 @@ def test_func(spark):
lambda data: [pd.DataFrame([len(list(data))])], schema="ret:integer")

assert_gpu_and_cpu_are_equal_collect(test_func, conf=arrow_udf_conf)


@pytest.mark.skipif(is_before_spark_350(),
Copy link
Collaborator

@firestarman firestarman Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Better to ignore order for result comparison.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to do that, since there is only 1 partition, the result will not be mess up.

@wbo4958 wbo4958 merged commit cca5955 into NVIDIA:branch-24.04 Feb 2, 2024
40 of 41 checks passed
@wbo4958 wbo4958 deleted the barrier branch February 2, 2024 06:15
jlowe added a commit to jlowe/spark-rapids that referenced this pull request Feb 2, 2024
This reverts commit cca5955.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
jlowe added a commit that referenced this pull request Feb 2, 2024
…0369)

This reverts commit cca5955.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@sameerz sameerz added the task Work required that improves the product but is not user facing label Feb 4, 2024
wbo4958 added a commit to wbo4958/spark-rapids that referenced this pull request Feb 5, 2024
Signed-off-by: Bobby Wang <wbo4958@gmail.com>
jlowe pushed a commit that referenced this pull request Feb 6, 2024
* Support barrier mode for mapInPandas/mapInArrow (#10364)

Signed-off-by: Bobby Wang <wbo4958@gmail.com>

* support databricks

* license

---------

Signed-off-by: Bobby Wang <wbo4958@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants