Skip to content

Commit

Permalink
Add in a GpuOOM exception so running out of GPU memory is not a fatal…
Browse files Browse the repository at this point in the history
… to Spark (NVIDIA#995)

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
  • Loading branch information
revans2 authored Mar 8, 2023
1 parent 3d339cc commit 47b120c
Show file tree
Hide file tree
Showing 3 changed files with 34 additions and 2 deletions.
32 changes: 32 additions & 0 deletions src/main/java/com/nvidia/spark/rapids/jni/GpuOOM.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package com.nvidia.spark.rapids.jni;

/**
* A special version of an out of memory error that indicates we ran out of GPU memory. This is
* mostly to avoid a fatal error that would force the worker process to restart. This should be
* recoverable on the GPU.
*/
public class GpuOOM extends RuntimeException {
public GpuOOM() {
super();
}

public GpuOOM(String message) {
super(message);
}
}
2 changes: 1 addition & 1 deletion src/main/java/com/nvidia/spark/rapids/jni/RetryOOM.java
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
* A special version of an out of memory error that indicates we ran out of memory, but should
* roll back to a point when all memory for the task is spillable and then retry the operation.
*/
public class RetryOOM extends OutOfMemoryError {
public class RetryOOM extends GpuOOM {
public RetryOOM() {
super();
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
* roll back to a point when all memory for the task is spillable and then retry the operation
* with the input data split to make it ideally use less GPU memory overall.
*/
public class SplitAndRetryOOM extends OutOfMemoryError {
public class SplitAndRetryOOM extends GpuOOM {
public SplitAndRetryOOM() {
super();
}
Expand Down

0 comments on commit 47b120c

Please # to comment.