-
Notifications
You must be signed in to change notification settings - Fork 34
Low performance on RX 580 with TF benchmarks #42
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
GTX1080Ti got total 32.48 images/sec. E tensorflow/core/grappler/optimizers/meta_optimizer.cc:533] remapper failed: Not found: Op type not registered '_CopyFromGpuToHost' in binary running on DT. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.)
|
Thank you for reporting your benchmark results! This is a preview and we only have a limited set of operators implemented at the moment, so results like this are not totally unexpected. As operator support gets closer to what CUDA/ROCm supports, we expect performance to get better and we'll be able to focus on it a lot more. We'll definitely look into this benchmark though and see where the bottlenecks are. |
First, absolutely thank you! Being able to do this from any OS that supports DirectX12, that is amazing. Second, if I can help, let me know. Summary:
Hardware: Stock laptop Acer Predator Helios 500 PH517-61-R0GX Gaming Laptop, AMD Ryzen 7 2700 Desktop Processor, AMD Radeon RX Vega 56 DirectML Results (python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50)
Results with --enable_optimizations=0 (python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --enable_optimizations=0):
ROCm Results (python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50):
|
It's great to hear that the DirectML stack is working well for you! These results are interesting, and it's good to hear that it's behaving in a stable manner because stability and correctness is something we invest a lot of time on. As @PatriceVignola mentioned this is a super early preview and we're still working hard on it, so you can definitely expect the performance to improve as time goes on. For example I suspect one of the reasons why |
@MatPoliquin , @sunshinejnjn , @ashaver , we just uploaded a new package that improves the performance of TensorFlow DirectML devices across the board. The package (1.15.3.dev200626) is now on pypi and can be installed it with
if it's your first time installing it or
if you installed the previous 1.15.3.dev200619 release. On a Radeon RX Vega, we see a ~63% performance increase for batch_size=16 and a ~47% performance increase for batch_size=32. These improvements are not limited to AMD cards though, so we are expecting similar improvements for Nvidia and Intel graphics. We realize that there is still a lot of room for improvement to catch up with ROCm and CUDA, but we aim to release packages regularly and keep the community updated on our progress. All feedback and data that we receive is very helpful as we work on closing the performance and functionality gap. Here are the full results for a Radeon RX Vega with a batch size of 16:
And here are the full results for a Radeon RX Vega with a batch size of 32:
Edit: Clarify package release timelines. |
Just tried the new 1.15.3.dev200626 version, I actually get worst performance on RX580 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50
|
This is interesting. I don't have access to an RX 580 at the moment, but we tried with 3 different AMD cards (Radeon VII, Radeon RX Vega and Radeon RX 5700 XT) and saw a 50% performance increase on average. I have a few questions to help me understand the issue:
Also, if you don't mind, could you take a trace, upload it somewhere and send us the link?
|
EDIT: for some reason I did not install the 1.15.3.dev200626 version properly, I reinstalled it and now I get better performance
Note:
|
I'm testing this thing on various devices including intel iGPU, amd iGPU (amd dGPU on native linux, not tested for now), nvidia 9xx 10xx 20xx systems. Especially, iGPUs are the most interesting part. Another device which is a dell xps 15 9550, i7-6700HQ with intel HD 530 running latest Intel beta driver. It ran this at 1.8 images/sec (on intel iGPU). Windows 10 2020H1, TF-DML build 200626. Later, I'm gonna test this on a intel i5-4000 with iGPU to see if it can run. |
@MatPoliquin Ah, these numbers make more sense. Thank you for double checking! Like @adtsai said, we didn't optimize our memory allocator yet so the performance increase for larger batch sizes is less noticeable and we end up utilizing more memory than necessary, but we're working on improving it. @sunshinejnjn What are the models of the iGPUs/dGPUs that crashed or froze while running the benchmark? |
with 26-6-2020 package (tensorflow_directml-1.15.3.dev200626-cp37-cp37m-win_amd64) my results on Titan V (451.58): python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50
on vega56 (20.20 branch 27.20.2001.5002) similar to others python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50
EDIT:
|
Adding my AMD 5700XT results (no overclock) in case it helps verify the expected outcome Windows 10 2004 - AMD Drivers 20.5.1-ghs-beta
and updated drivers 20.7.1
|
Windows 10 2004/AMD Driver v27.20.1017.1011/DirectML
|
Hi, now I get:
on June was getting:
I get on console output lots of messages like posted below..
seems wasn't getting this on June testing.. |
@oscarbg Were you not getting these logs with the previous package? I think these logs are expected when running tf_cnn_benchmarks since it doesn't know anything about DML, but it should still try to fallback to DML instead of the CPU when possible. We'll investigate the performance regression. |
Wanted to add my results! First off this is some amazing work - really helps increase the accessibility of ML tools!!! I am running a Zephyrus G14 which has two GPUs, the integrated Radeon and a 2060-MaxQ. Running in windows for now (will try WSL-2 soon as well). Can confirm that I can get bigger batch sizes using the Radeon (which has access to 40GB of installed system ram - yes.. 8gb soldered with a 32gb stick.. yay!) than I can with the Max-Q (6gb vram only). This really opens up a lot of possibilities.. but.. its all just pretty slow compared to the CPU (the 4900HS). First.. the CPU:
Next - the Radeon (I had to disable the Max-Q in device manager.. couldn't find an easy way to make the gigantic Tensorflow benchmark pick /dml:1 which is the Radeon, /dml:0 being the Max-Q):
Sadly slower than the CPU, with access to the same amount of memory so no difference in batch size there. Last but not least, the nvidia 2060-MaxQ:
I expect CUDA performance with WSL-2/Cuda to be a lot better.. but that's not the point here! All in all - as performance improves this will really change the game for ML beginners and pros alike!! Excited to be able to, AC-922 style, access the entire system memory for optimized GPU computations. Thanks again @microsoft for this important work. |
@metemadi Thank you for the additional data points - your setup looks awesome for ML! We're currently focused on improving stability and coverage, but the next step is obviously to be way more competitive with CUDA. So far we've been focused on coverage from the ai-benchmark models, but the TF benchmarks repo is something we're starting to look into. |
just pointing new 202104 release is much faster.. python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50
maybe this issue can be closed? thanks.. |
confirm RX Vega also gets 2X speedup vs latest results: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50
|
I get low performance on TF benchmarks with my RX 580:
https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks
using their example command:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server
I get this error and performance result:
2020-06-19 16:01:17.369204: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:533] remapper failed: Not found: Op type not registered '_CopyFromGpuToHost'
Step Img/sec total_loss
1 images/sec: 4.8 +/- 0.0 (jitter = 0.0) 8.169
10 images/sec: 4.7 +/- 0.0 (jitter = 0.1) 7.593
20 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 7.696
30 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 7.753
40 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 8.007
50 images/sec: 4.8 +/- 0.0 (jitter = 0.1) 7.520
60 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 7.989
70 images/sec: 4.8 +/- 0.0 (jitter = 0.1) 8.028
80 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.932
90 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.850
100 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.798
total images/sec: 4.90
Note:
Info:
The text was updated successfully, but these errors were encountered: