Shared system memory is not used in Tensorflow #1251

tarcey · 2021-02-07T20:34:14Z

Please make sure that this is a feature request. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:feature_template

System information

TensorFlow version (you are using): 2.4
Are you willing to contribute it (Yes/No): ?

Describe the feature and the current behavior/state.

When using large models and/or large batch sizes which would not fit into the GPUs VRAM, one might expect that the system RAM would be used in addition to the VRAM to avoid OOM, similar to how TF works with nvidia's UVM. However, this is not the case and the program crashes with an OOM error. It seems that ROCm already has support for unified memory, but tensorflow-rocm just doesn't make use of it.

Will this change the current api? How?

No

Who will benefit with this feature?

Anyone, especially in a situation as explained in 'Other Info'.

Any Other info.

ROCm version: 4.0.1
GPU: Vega FE (gfx900)

When using batches with varying dimensions as e.g. in sequential models, a few outlier batches with particularly long sequences can lead to an unexpected OOM crash after hours of training. When using unified memory, such situations can be avoided without having to resort to small batch sizes and accepting underutilization of resources. The performance penalty of using unified memory would only affect these few outlier batches, and the performance benefit of larger batch sizes would outweigh this cost because the majority of batches still fit into the VRAM.

sunway513 · 2021-03-01T04:53:22Z

@deven-amd can you help look at this issue?

daiaji · 2023-03-21T07:02:44Z

Is there progress? This should be helpful for training models on consumer-grade platforms.

deven-amd self-assigned this Mar 4, 2021

daiaji mentioned this issue Mar 21, 2023

【2023】猫娘方法持续讨论更新 PlexPt/awesome-chatgpt-prompts-zh#12

Open

ppanchad-amd added the Under Investigation label Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared system memory is not used in Tensorflow #1251

Shared system memory is not used in Tensorflow #1251

tarcey commented Feb 7, 2021 •

edited

Loading

sunway513 commented Mar 1, 2021

daiaji commented Mar 21, 2023

Shared system memory is not used in Tensorflow #1251

Shared system memory is not used in Tensorflow #1251

Comments

tarcey commented Feb 7, 2021 • edited Loading

sunway513 commented Mar 1, 2021

daiaji commented Mar 21, 2023

tarcey commented Feb 7, 2021 •

edited

Loading