Skip to content

[rocm6.2_internal_testing] [SWDEV-469514] hipGraphExecDestroy requires an explicit sync #1455

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Conversation

pragupta
Copy link

@pragupta pragupta commented Jul 9, 2024

There is a new hip feature where they do not free hipGraph memory as soon as hipGraphExecDestroy is called. This is to support async work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung. Essentially, ncclCommAbort was waiting for all GPU activity to finish. However, since hipGraph memory was technically still in use, we had an infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not require this extra sync in order to successfully run the UT. It seems that they were calling cudaGraphInstantiateWithFlags with cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees memory after graph lauch. There is support for this API in our ROCm stack, but we were missing cuda to hip mappings in PyTorch. So, I brought them in and added the necesary conditions to call this API in HIP case also.

There is a new hip feature where they do not free hipGraph memory
as soon as hipGraphExecDestroy is called. This is to support async
work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung.
Essentially, ncclCommAbort was waiting for all GPU activity to finish.
However, since hipGraph memory was technically still in use, we had an
infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's
destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not
require this extra sync in order to successfully run the UT. It seems
that they were calling cudaGraphInstantiateWithFlags with
cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees
memory after graph lauch. There is support for this API in our ROCm
stack, but we were missing cuda to hip mappings in PyTorch. So, I
brought them in and added the necesary conditions to call this API in
HIP case also.
// There are recent HIP changes where hipGraphExecDestroy doesn't immediately free memory.
// They wait for next sync point in order to free the memory, this is to ensure that all
// hipGraphLaunch are finished before we release any memory. This feature was enabled in rocm6.2.
#if (defined(ROCM_VERSION) && ROCM_VERSION >= 60200)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the extra sync required now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as cudaGraphInstantiateFlagAutoFreeOnLaunch only adds async frees after each launch, we need to ensure all async opreations finish before destroying the graph.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cudagraphinstantiateflagautofreeonlaunch

Copy link
Collaborator

@jithunnair-amd jithunnair-amd Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pragupta I'm not sure I understand why this is a ROCm-only logic?

Copy link
Author

@pragupta pragupta Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jithunnair-amd I also think that cuda will need this change also, they just haven't run into it yet. If you really stress test it and your async frees are not finishing before hitting the destructor, then you'd need it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffdaily @pruthvistony I had an offline discussion with @pragupta on this, and I feel like it's okay to merge this patch into our ROCm fork as being a ROCm-conditional one, but for the upstream PR, I'd rather see it unconditional since our understanding is that this issue affects CUDA as well, if the flow reaches the destructor before the async frees are executed. For the upstream PR, I suggested Prachi to create a new test case that generates this scenario, so we can easily justify why these extra syncs are needed. Let me know your thoughts.

@pruthvistony
Copy link
Collaborator

This HIP change should affect other pytorch branches, so all failing release branch would need this change cherry-picked

@jithunnair-amd jithunnair-amd changed the title [SWDEV-469514] hipGraphExecDestroy requires an explicit sync [rocm6.2_internal_testing] [SWDEV-469514] hipGraphExecDestroy requires an explicit sync Jul 12, 2024
@jithunnair-amd jithunnair-amd merged commit e752b4f into ROCm:rocm6.2_internal_testing Jul 12, 2024
pragupta added a commit to pragupta/pytorch that referenced this pull request Jul 12, 2024
… sync (ROCm#1455)

* [SWDEV-469514] hipGraphExecDestroy requires an explicit sync

There is a new hip feature where they do not free hipGraph memory
as soon as hipGraphExecDestroy is called. This is to support async
work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung.
Essentially, ncclCommAbort was waiting for all GPU activity to finish.
However, since hipGraph memory was technically still in use, we had an
infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's
destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not
require this extra sync in order to successfully run the UT. It seems
that they were calling cudaGraphInstantiateWithFlags with
cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees
memory after graph lauch. There is support for this API in our ROCm
stack, but we were missing cuda to hip mappings in PyTorch. So, I
brought them in and added the necesary conditions to call this API in
HIP case also.

* Update comments

* Use USE_ROCM in keeping with convention

* Use USE_ROCM to match convention

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit e752b4f)
pragupta added a commit to pragupta/pytorch that referenced this pull request Jul 12, 2024
… sync (ROCm#1455)

* [SWDEV-469514] hipGraphExecDestroy requires an explicit sync

There is a new hip feature where they do not free hipGraph memory
as soon as hipGraphExecDestroy is called. This is to support async
work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung.
Essentially, ncclCommAbort was waiting for all GPU activity to finish.
However, since hipGraph memory was technically still in use, we had an
infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's
destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not
require this extra sync in order to successfully run the UT. It seems
that they were calling cudaGraphInstantiateWithFlags with
cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees
memory after graph lauch. There is support for this API in our ROCm
stack, but we were missing cuda to hip mappings in PyTorch. So, I
brought them in and added the necesary conditions to call this API in
HIP case also.

* Update comments

* Use USE_ROCM in keeping with convention

* Use USE_ROCM to match convention

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit e752b4f)
pragupta added a commit to pragupta/pytorch that referenced this pull request Jul 12, 2024
… sync (ROCm#1455)

* [SWDEV-469514] hipGraphExecDestroy requires an explicit sync

There is a new hip feature where they do not free hipGraph memory
as soon as hipGraphExecDestroy is called. This is to support async
work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung.
Essentially, ncclCommAbort was waiting for all GPU activity to finish.
However, since hipGraph memory was technically still in use, we had an
infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's
destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not
require this extra sync in order to successfully run the UT. It seems
that they were calling cudaGraphInstantiateWithFlags with
cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees
memory after graph lauch. There is support for this API in our ROCm
stack, but we were missing cuda to hip mappings in PyTorch. So, I
brought them in and added the necesary conditions to call this API in
HIP case also.

* Update comments

* Use USE_ROCM in keeping with convention

* Use USE_ROCM to match convention

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit e752b4f)
jithunnair-amd added a commit that referenced this pull request Jul 12, 2024
… sync (#1455) (#1470)

* [SWDEV-469514] hipGraphExecDestroy requires an explicit sync

There is a new hip feature where they do not free hipGraph memory
as soon as hipGraphExecDestroy is called. This is to support async
work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung.
Essentially, ncclCommAbort was waiting for all GPU activity to finish.
However, since hipGraph memory was technically still in use, we had an
infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's
destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not
require this extra sync in order to successfully run the UT. It seems
that they were calling cudaGraphInstantiateWithFlags with
cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees
memory after graph lauch. There is support for this API in our ROCm
stack, but we were missing cuda to hip mappings in PyTorch. So, I
brought them in and added the necesary conditions to call this API in
HIP case also.

* Update comments

* Use USE_ROCM in keeping with convention

* Use USE_ROCM to match convention

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit e752b4f)
jithunnair-amd added a commit that referenced this pull request Jul 12, 2024
… sync (#1455) (#1471)

* [SWDEV-469514] hipGraphExecDestroy requires an explicit sync

There is a new hip feature where they do not free hipGraph memory
as soon as hipGraphExecDestroy is called. This is to support async
work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung.
Essentially, ncclCommAbort was waiting for all GPU activity to finish.
However, since hipGraph memory was technically still in use, we had an
infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's
destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not
require this extra sync in order to successfully run the UT. It seems
that they were calling cudaGraphInstantiateWithFlags with
cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees
memory after graph lauch. There is support for this API in our ROCm
stack, but we were missing cuda to hip mappings in PyTorch. So, I
brought them in and added the necesary conditions to call this API in
HIP case also.

* Update comments

* Use USE_ROCM in keeping with convention

* Use USE_ROCM to match convention

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit e752b4f)
jithunnair-amd added a commit that referenced this pull request Jul 12, 2024
… sync (#1455) (#1472)

* [SWDEV-469514] hipGraphExecDestroy requires an explicit sync

There is a new hip feature where they do not free hipGraph memory
as soon as hipGraphExecDestroy is called. This is to support async
work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung.
Essentially, ncclCommAbort was waiting for all GPU activity to finish.
However, since hipGraph memory was technically still in use, we had an
infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's
destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not
require this extra sync in order to successfully run the UT. It seems
that they were calling cudaGraphInstantiateWithFlags with
cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees
memory after graph lauch. There is support for this API in our ROCm
stack, but we were missing cuda to hip mappings in PyTorch. So, I
brought them in and added the necesary conditions to call this API in
HIP case also.

* Update comments

* Use USE_ROCM in keeping with convention

* Use USE_ROCM to match convention

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit e752b4f)
pragupta added a commit to pragupta/pytorch that referenced this pull request Jul 12, 2024
… sync (ROCm#1455)

* [SWDEV-469514] hipGraphExecDestroy requires an explicit sync

There is a new hip feature where they do not free hipGraph memory
as soon as hipGraphExecDestroy is called. This is to support async
work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung.
Essentially, ncclCommAbort was waiting for all GPU activity to finish.
However, since hipGraph memory was technically still in use, we had an
infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's
destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not
require this extra sync in order to successfully run the UT. It seems
that they were calling cudaGraphInstantiateWithFlags with
cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees
memory after graph lauch. There is support for this API in our ROCm
stack, but we were missing cuda to hip mappings in PyTorch. So, I
brought them in and added the necesary conditions to call this API in
HIP case also.

* Update comments

* Use USE_ROCM in keeping with convention

* Use USE_ROCM to match convention

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit e752b4f)
pragupta added a commit to pragupta/pytorch that referenced this pull request Jul 12, 2024
…t sync (ROCm#1455)

* [SWDEV-469514] hipGraphExecDestroy requires an explicit sync

There is a new hip feature where they do not free hipGraph memory
as soon as hipGraphExecDestroy is called. This is to support async
work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung.
Essentially, ncclCommAbort was waiting for all GPU activity to finish.
However, since hipGraph memory was technically still in use, we had an
infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's
destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not
require this extra sync in order to successfully run the UT. It seems
that they were calling cudaGraphInstantiateWithFlags with
cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees
memory after graph lauch. There is support for this API in our ROCm
stack, but we were missing cuda to hip mappings in PyTorch. So, I
brought them in and added the necesary conditions to call this API in
HIP case also.

* Update comments

* Use USE_ROCM in keeping with convention

* Use USE_ROCM to match convention

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit e752b4f)
pruthvistony pushed a commit that referenced this pull request Jul 15, 2024
… sync (#1455) (#1473)

* [SWDEV-469514] hipGraphExecDestroy requires an explicit sync

There is a new hip feature where they do not free hipGraph memory
as soon as hipGraphExecDestroy is called. This is to support async
work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung.
Essentially, ncclCommAbort was waiting for all GPU activity to finish.
However, since hipGraph memory was technically still in use, we had an
infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's
destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not
require this extra sync in order to successfully run the UT. It seems
that they were calling cudaGraphInstantiateWithFlags with
cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees
memory after graph lauch. There is support for this API in our ROCm
stack, but we were missing cuda to hip mappings in PyTorch. So, I
brought them in and added the necesary conditions to call this API in
HIP case also.

* Update comments

* Use USE_ROCM in keeping with convention

* Use USE_ROCM to match convention

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit e752b4f)
pruthvistony pushed a commit that referenced this pull request Jul 15, 2024
…t sync (#1455) (#1474)

* [SWDEV-469514] hipGraphExecDestroy requires an explicit sync

There is a new hip feature where they do not free hipGraph memory
as soon as hipGraphExecDestroy is called. This is to support async
work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung.
Essentially, ncclCommAbort was waiting for all GPU activity to finish.
However, since hipGraph memory was technically still in use, we had an
infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's
destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not
require this extra sync in order to successfully run the UT. It seems
that they were calling cudaGraphInstantiateWithFlags with
cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees
memory after graph lauch. There is support for this API in our ROCm
stack, but we were missing cuda to hip mappings in PyTorch. So, I
brought them in and added the necesary conditions to call this API in
HIP case also.

* Update comments

* Use USE_ROCM in keeping with convention

* Use USE_ROCM to match convention

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit e752b4f)
dnikolaev-amd pushed a commit that referenced this pull request Aug 1, 2024
… sync (#1455) (#1472)

* [SWDEV-469514] hipGraphExecDestroy requires an explicit sync

There is a new hip feature where they do not free hipGraph memory
as soon as hipGraphExecDestroy is called. This is to support async
work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung.
Essentially, ncclCommAbort was waiting for all GPU activity to finish.
However, since hipGraph memory was technically still in use, we had an
infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's
destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not
require this extra sync in order to successfully run the UT. It seems
that they were calling cudaGraphInstantiateWithFlags with
cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees
memory after graph lauch. There is support for this API in our ROCm
stack, but we were missing cuda to hip mappings in PyTorch. So, I
brought them in and added the necesary conditions to call this API in
HIP case also.

* Update comments

* Use USE_ROCM in keeping with convention

* Use USE_ROCM to match convention

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit e752b4f)
jithunnair-amd added a commit that referenced this pull request Oct 23, 2024
… sync (#1455) (#1472)

* [SWDEV-469514] hipGraphExecDestroy requires an explicit sync

There is a new hip feature where they do not free hipGraph memory
as soon as hipGraphExecDestroy is called. This is to support async
work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung.
Essentially, ncclCommAbort was waiting for all GPU activity to finish.
However, since hipGraph memory was technically still in use, we had an
infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's
destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not
require this extra sync in order to successfully run the UT. It seems
that they were calling cudaGraphInstantiateWithFlags with
cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees
memory after graph lauch. There is support for this API in our ROCm
stack, but we were missing cuda to hip mappings in PyTorch. So, I
brought them in and added the necesary conditions to call this API in
HIP case also.

* Update comments

* Use USE_ROCM in keeping with convention

* Use USE_ROCM to match convention

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit e752b4f)
(cherry picked from commit d6b8773)
jithunnair-amd added a commit that referenced this pull request Mar 17, 2025
… sync (#1455) (#1472)

* [SWDEV-469514] hipGraphExecDestroy requires an explicit sync

There is a new hip feature where they do not free hipGraph memory
as soon as hipGraphExecDestroy is called. This is to support async
work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung.
Essentially, ncclCommAbort was waiting for all GPU activity to finish.
However, since hipGraph memory was technically still in use, we had an
infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's
destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not
require this extra sync in order to successfully run the UT. It seems
that they were calling cudaGraphInstantiateWithFlags with
cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees
memory after graph lauch. There is support for this API in our ROCm
stack, but we were missing cuda to hip mappings in PyTorch. So, I
brought them in and added the necesary conditions to call this API in
HIP case also.

* Update comments

* Use USE_ROCM in keeping with convention

* Use USE_ROCM to match convention

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit e752b4f)
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants