Skip to content

Low performance on RX 580 with TF benchmarks #42

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
MatPoliquin opened this issue Jun 19, 2020 · 19 comments
Open

Low performance on RX 580 with TF benchmarks #42

MatPoliquin opened this issue Jun 19, 2020 · 19 comments

Comments

@MatPoliquin
Copy link

I get low performance on TF benchmarks with my RX 580:
https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks

using their example command:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server

I get this error and performance result:
2020-06-19 16:01:17.369204: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:533] remapper failed: Not found: Op type not registered '_CopyFromGpuToHost'

Step Img/sec total_loss
1 images/sec: 4.8 +/- 0.0 (jitter = 0.0) 8.169
10 images/sec: 4.7 +/- 0.0 (jitter = 0.1) 7.593
20 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 7.696
30 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 7.753
40 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 8.007
50 images/sec: 4.8 +/- 0.0 (jitter = 0.1) 7.520
60 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 7.989
70 images/sec: 4.8 +/- 0.0 (jitter = 0.1) 8.028
80 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.932
90 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.850
100 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.798

total images/sec: 4.90

Note:

  • GPU and VRAM usage are at 100%, so it's not using the CPU
  • I get around 88 image/s on latest version of ROCm (Ubuntu 20.04) with this computer

Info:

  • RX 580 8GB driver 26.20.12028.2
  • Dual Intel Xeon 2680 v2
  • 64 GB ram
  • Windows 10 2004
  • OSbuild 19041.329
  • python 3.7
@MatPoliquin MatPoliquin changed the title low performance on RX 580 with TF benchmarks Low performance on RX 580 with TF benchmarks Jun 19, 2020
@sunshinejnjn
Copy link

GTX1080Ti got total 32.48 images/sec.

E tensorflow/core/grappler/optimizers/meta_optimizer.cc:533] remapper failed: Not found: Op type not registered '_CopyFromGpuToHost' in binary running on DT. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resampler should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
Done warm up
Step Img/sec total_loss
1 images/sec: 35.7 +/- 0.0 (jitter = 0.0) 8.169
10 images/sec: 32.2 +/- 1.1 (jitter = 4.0) 7.593
20 images/sec: 32.7 +/- 0.7 (jitter = 3.3) 7.696
30 images/sec: 32.3 +/- 0.7 (jitter = 3.7) 7.753
40 images/sec: 32.3 +/- 0.6 (jitter = 4.5) 8.007
50 images/sec: 32.7 +/- 0.5 (jitter = 4.2) 7.520
60 images/sec: 32.8 +/- 0.5 (jitter = 3.9) 7.988
70 images/sec: 32.5 +/- 0.5 (jitter = 3.9) 8.028
80 images/sec: 32.6 +/- 0.4 (jitter = 3.7) 7.932
90 images/sec: 32.4 +/- 0.4 (jitter = 4.0) 7.850
100 images/sec: 32.5 +/- 0.4 (jitter = 3.9) 7.795

total images/sec: 32.48

Tensorflow-GPU 1.15.3 (official) with cuda got 174.89 images/sec.

The system is an AMD R7 1700X with 64GB RAM. Windows 10 20H1. Also, half of the 1st screen flashes a little bit for a few times throughout the benchmark.

@PatriceVignola
Copy link
Contributor

Thank you for reporting your benchmark results! This is a preview and we only have a limited set of operators implemented at the moment, so results like this are not totally unexpected. As operator support gets closer to what CUDA/ROCm supports, we expect performance to get better and we'll be able to focus on it a lot more. We'll definitely look into this benchmark though and see where the bottlenecks are.

@ashaver
Copy link

ashaver commented Jun 26, 2020

First, absolutely thank you! Being able to do this from any OS that supports DirectX12, that is amazing. Second, if I can help, let me know.

Summary:

  • Seeing a comparison right now of about 128 images/sec on ROCm versus 21 images/sec on DirectML.
  • Also, the jitter on ROCm is two orders of magnitude larger. (Maybe that helps make a fair comparison.)
  • The optimal batch size for my hardware for ROCm is 32 and DirectML is 16. Using a batch of 32 on DirectML was 3x slower.
  • The stack for DirectML (Windows/AMD Driver/DirectML) is so much more stable than the ROCm stack (Linux/AMD Driver/ROCm). ROCm sometimes is not able to even set clock frequencies (and has never been able to control fans), referencing post in ROCm speed comparison I made. I cannot tell you how much I appreciate a stable stack. Mean time to failure using ROCm was around an hour, which precludes any significant work (especially when the only recovery from failure is to reboot). I have not had any issues with DirectML.

Hardware: Stock laptop Acer Predator Helios 500 PH517-61-R0GX Gaming Laptop, AMD Ryzen 7 2700 Desktop Processor, AMD Radeon RX Vega 56

DirectML Results (python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50)

Step    Img/sec total_loss
1       images/sec: 20.3 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 20.6 +/- 0.1 (jitter = 0.3) 7.854
20      images/sec: 20.6 +/- 0.1 (jitter = 0.2) 7.726
30      images/sec: 20.5 +/- 0.1 (jitter = 0.2) 7.360
40      images/sec: 20.6 +/- 0.0 (jitter = 0.3) 7.526
50      images/sec: 20.6 +/- 0.0 (jitter = 0.2) 8.171
60      images/sec: 20.6 +/- 0.0 (jitter = 0.2) 7.999
70      images/sec: 20.6 +/- 0.0 (jitter = 0.2) 7.978
80      images/sec: 20.6 +/- 0.0 (jitter = 0.2) 7.884
90      images/sec: 20.6 +/- 0.0 (jitter = 0.2) 7.924
100     images/sec: 20.6 +/- 0.0 (jitter = 0.2) 7.848
----------------------------------------------------------------
total images/sec: 20.65
----------------------------------------------------------------

Results with --enable_optimizations=0 (python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --enable_optimizations=0):

Step    Img/sec total_loss
1       images/sec: 30.1 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 30.1 +/- 0.1 (jitter = 0.3) 7.854
20      images/sec: 30.1 +/- 0.1 (jitter = 0.1) 7.726
30      images/sec: 30.2 +/- 0.1 (jitter = 0.2) 7.360
40      images/sec: 30.1 +/- 0.0 (jitter = 0.2) 7.527
50      images/sec: 30.1 +/- 0.0 (jitter = 0.2) 8.171
60      images/sec: 30.0 +/- 0.0 (jitter = 0.2) 7.999
70      images/sec: 30.0 +/- 0.0 (jitter = 0.2) 7.978
80      images/sec: 30.0 +/- 0.0 (jitter = 0.2) 7.884
90      images/sec: 30.0 +/- 0.0 (jitter = 0.2) 7.925
100     images/sec: 30.1 +/- 0.0 (jitter = 0.2) 7.848
----------------------------------------------------------------
total images/sec: 27.27
----------------------------------------------------------------

ROCm Results (python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50):

Step	Img/sec	total_loss
1	images/sec: 131.4 +/- 0.0 (jitter = 0.0)	8.458
10	images/sec: 130.0 +/- 0.9 (jitter = 2.9)	7.997
20	images/sec: 129.1 +/- 0.6 (jitter = 2.2)	8.260
30	images/sec: 128.6 +/- 0.5 (jitter = 2.0)	8.338
40	images/sec: 128.4 +/- 0.4 (jitter = 2.3)	8.190
50	images/sec: 128.0 +/- 0.4 (jitter = 2.7)	7.742
60	images/sec: 128.2 +/- 0.4 (jitter = 2.4)	8.061
70	images/sec: 128.3 +/- 0.3 (jitter = 2.4)	inf
80	images/sec: 128.3 +/- 0.3 (jitter = 2.5)	inf
90	images/sec: 128.2 +/- 0.3 (jitter = 2.5)	inf
100	images/sec: 128.2 +/- 0.3 (jitter = 2.5)	inf
----------------------------------------------------------------
total images/sec: 128.13
----------------------------------------------------------------

@adtsai
Copy link
Contributor

adtsai commented Jun 26, 2020

It's great to hear that the DirectML stack is working well for you! These results are interesting, and it's good to hear that it's behaving in a stable manner because stability and correctness is something we invest a lot of time on.

As @PatriceVignola mentioned this is a super early preview and we're still working hard on it, so you can definitely expect the performance to improve as time goes on. For example I suspect one of the reasons why --batch_size 32 is so much slower on DML is because we haven't optimized our memory allocator yet, which means that at high batch sizes we end up using more VRAM than necessary in some circumstances, which leads to a performance cliff. But rest assured we're working on it. :)

@PatriceVignola
Copy link
Contributor

PatriceVignola commented Jun 27, 2020

@MatPoliquin , @sunshinejnjn , @ashaver , we just uploaded a new package that improves the performance of TensorFlow DirectML devices across the board. The package (1.15.3.dev200626) is now on pypi and can be installed it with

pip install tensorflow-directml

if it's your first time installing it or

pip install tensorflow-directml --upgrade

if you installed the previous 1.15.3.dev200619 release.

On a Radeon RX Vega, we see a ~63% performance increase for batch_size=16 and a ~47% performance increase for batch_size=32. These improvements are not limited to AMD cards though, so we are expecting similar improvements for Nvidia and Intel graphics.

We realize that there is still a lot of room for improvement to catch up with ROCm and CUDA, but we aim to release packages regularly and keep the community updated on our progress. All feedback and data that we receive is very helpful as we work on closing the performance and functionality gap.

Here are the full results for a Radeon RX Vega with a batch size of 16:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 36.6 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 35.7 +/- 0.2 (jitter = 0.0) 7.854
20      images/sec: 35.6 +/- 0.1 (jitter = 0.0) 7.726
30      images/sec: 35.6 +/- 0.1 (jitter = 0.0) 7.360
40      images/sec: 35.5 +/- 0.1 (jitter = 0.0) 7.526
50      images/sec: 35.6 +/- 0.1 (jitter = 0.0) 8.171
60      images/sec: 35.5 +/- 0.1 (jitter = 0.0) 7.999
70      images/sec: 35.5 +/- 0.1 (jitter = 0.0) 7.978
80      images/sec: 35.5 +/- 0.1 (jitter = 0.0) 7.884
90      images/sec: 35.5 +/- 0.1 (jitter = 1.7) 7.924
100     images/sec: 35.5 +/- 0.1 (jitter = 1.7) 7.848
----------------------------------------------------------------
total images/sec: 35.48
----------------------------------------------------------------

And here are the full results for a Radeon RX Vega with a batch size of 32:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 8.8 +/- 0.0 (jitter = 0.0)  8.169
10      images/sec: 9.3 +/- 0.1 (jitter = 0.7)  7.593
20      images/sec: 9.3 +/- 0.1 (jitter = 0.4)  7.696
30      images/sec: 9.3 +/- 0.1 (jitter = 0.5)  7.753
40      images/sec: 9.3 +/- 0.1 (jitter = 0.4)  8.007
50      images/sec: 9.3 +/- 0.1 (jitter = 0.4)  7.520
60      images/sec: 9.3 +/- 0.0 (jitter = 0.4)  7.990
70      images/sec: 9.3 +/- 0.0 (jitter = 0.4)  8.028
80      images/sec: 9.3 +/- 0.0 (jitter = 0.4)  7.931
90      images/sec: 9.3 +/- 0.0 (jitter = 0.4)  7.851
100     images/sec: 9.3 +/- 0.0 (jitter = 0.4)  7.797
----------------------------------------------------------------
total images/sec: 9.26
----------------------------------------------------------------

Edit: Clarify package release timelines.

@MatPoliquin
Copy link
Author

MatPoliquin commented Jun 27, 2020

Just tried the new 1.15.3.dev200626 version, I actually get worst performance on RX580

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 3.9 +/- 0.0 (jitter = 0.0)  8.169
10      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.593
20      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.696
30      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.753
40      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  8.007
50      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.520
60      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.988
70      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  8.029
80      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.932
90      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.850
100     images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.799
----------------------------------------------------------------
total images/sec: 4.04
----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 10.3 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 10.5 +/- 0.1 (jitter = 0.3) 7.854
20      images/sec: 10.5 +/- 0.1 (jitter = 0.3) 7.726
30      images/sec: 10.7 +/- 0.1 (jitter = 0.4) 7.360
40      images/sec: 10.7 +/- 0.1 (jitter = 0.4) 7.527
50      images/sec: 10.7 +/- 0.1 (jitter = 0.3) 8.171
60      images/sec: 10.7 +/- 0.1 (jitter = 0.4) 7.999
70      images/sec: 10.7 +/- 0.0 (jitter = 0.4) 7.978
80      images/sec: 10.8 +/- 0.0 (jitter = 0.4) 7.884
90      images/sec: 10.8 +/- 0.0 (jitter = 0.5) 7.924
100     images/sec: 10.9 +/- 0.0 (jitter = 0.5) 7.847
----------------------------------------------------------------
total images/sec: 10.88
----------------------------------------------------------------

@PatriceVignola
Copy link
Contributor

PatriceVignola commented Jun 27, 2020

This is interesting. I don't have access to an RX 580 at the moment, but we tried with 3 different AMD cards (Radeon VII, Radeon RX Vega and Radeon RX 5700 XT) and saw a 50% performance increase on average. I have a few questions to help me understand the issue:

  1. Are you running the benchmark on WSL or on Windows?
  2. Do you have other graphics cards on your machine?
  3. If you run by disabling grappler optimizations (python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --enable_optimizations=0), does it get better or worse?

Also, if you don't mind, could you take a trace, upload it somewhere and send us the link?

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --trace_file=trace.json

@MatPoliquin
Copy link
Author

MatPoliquin commented Jun 28, 2020

EDIT: for some reason I did not install the 1.15.3.dev200626 version properly, I reinstalled it and now I get better performance

  1. Windows
  2. only one GPU
  3. here is the result:
Step    Img/sec total_loss
1       images/sec: 4.9 +/- 0.0 (jitter = 0.0)  8.169
10      images/sec: 4.8 +/- 0.0 (jitter = 0.1)  7.593
20      images/sec: 4.9 +/- 0.0 (jitter = 0.1)  7.696
30      images/sec: 4.9 +/- 0.0 (jitter = 0.1)  7.753
40      images/sec: 5.0 +/- 0.0 (jitter = 0.1)  8.007
50      images/sec: 5.0 +/- 0.0 (jitter = 0.2)  7.520
60      images/sec: 5.0 +/- 0.0 (jitter = 0.2)  7.988
70      images/sec: 5.0 +/- 0.0 (jitter = 0.2)  8.029
80      images/sec: 5.0 +/- 0.0 (jitter = 0.2)  7.932
90      images/sec: 5.0 +/- 0.0 (jitter = 0.2)  7.850
100     images/sec: 5.0 +/- 0.0 (jitter = 0.2)  7.799
----------------------------------------------------------------
total images/sec: 4.98
----------------------------------------------------------------
  1. Here is the zipped trace.json file
    trace.zip

Note:
The performance increase is more noticeable with --batch=16:

Step    Img/sec total_loss
1       images/sec: 20.2 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 20.1 +/- 0.0 (jitter = 0.2) 7.854
20      images/sec: 20.1 +/- 0.1 (jitter = 0.2) 7.726
30      images/sec: 20.1 +/- 0.0 (jitter = 0.2) 7.360
40      images/sec: 20.1 +/- 0.0 (jitter = 0.2) 7.527
50      images/sec: 20.2 +/- 0.0 (jitter = 0.2) 8.171
60      images/sec: 20.2 +/- 0.0 (jitter = 0.2) 7.999
70      images/sec: 20.2 +/- 0.0 (jitter = 0.2) 7.978
80      images/sec: 20.2 +/- 0.0 (jitter = 0.2) 7.884
90      images/sec: 20.2 +/- 0.0 (jitter = 0.2) 7.924
100     images/sec: 20.2 +/- 0.0 (jitter = 0.2) 7.847
----------------------------------------------------------------
total images/sec: 20.18
----------------------------------------------------------------

@sunshinejnjn
Copy link

sunshinejnjn commented Jun 28, 2020

I'm testing this thing on various devices including intel iGPU, amd iGPU (amd dGPU on native linux, not tested for now), nvidia 9xx 10xx 20xx systems. Especially, iGPUs are the most interesting part.
AMD ryzen 4500U with vega 6 failed to run the benchmark with build 200615. The system froze when running it. And it seemed to be a gpu reset with a beep after 1 minute or so, then reported some error as output.
with build 200626, the situation is similar, no beep reset but still unable to run.
System version: Windows 10 2020H1
Driver version: AMD 27.20.1017.1011 (dated 20200525, newest amd gpu driver at present, Adrenalin 2020 Edition 20.5.1)

Another device which is a dell xps 15 9550, i7-6700HQ with intel HD 530 running latest Intel beta driver. It ran this at 1.8 images/sec (on intel iGPU). Windows 10 2020H1, TF-DML build 200626.

Later, I'm gonna test this on a intel i5-4000 with iGPU to see if it can run.

@PatriceVignola
Copy link
Contributor

@MatPoliquin Ah, these numbers make more sense. Thank you for double checking! Like @adtsai said, we didn't optimize our memory allocator yet so the performance increase for larger batch sizes is less noticeable and we end up utilizing more memory than necessary, but we're working on improving it.

@sunshinejnjn What are the models of the iGPUs/dGPUs that crashed or froze while running the benchmark?

@oscarbg
Copy link

oscarbg commented Jun 28, 2020

with 26-6-2020 package (tensorflow_directml-1.15.3.dev200626-cp37-cp37m-win_amd64)

my results on Titan V (451.58):

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 95.8 +/- 0.0 (jitter = 0.0) 8.169
10      images/sec: 94.8 +/- 0.4 (jitter = 1.1) 7.593
20      images/sec: 95.1 +/- 0.2 (jitter = 0.9) 7.696
30      images/sec: 94.8 +/- 0.2 (jitter = 1.1) 7.753
40      images/sec: 94.9 +/- 0.2 (jitter = 0.7) 8.006
50      images/sec: 94.7 +/- 0.1 (jitter = 1.0) 7.520
60      images/sec: 94.6 +/- 0.1 (jitter = 0.9) 7.989
70      images/sec: 94.5 +/- 0.1 (jitter = 0.8) 8.028
80      images/sec: 94.5 +/- 0.1 (jitter = 0.8) 7.930
90      images/sec: 94.4 +/- 0.1 (jitter = 0.8) 7.849
100     images/sec: 94.3 +/- 0.1 (jitter = 0.9) 7.795
----------------------------------------------------------------
total images/sec: 94.29
----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 80.1 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 80.0 +/- 0.3 (jitter = 0.6) 7.854
20      images/sec: 80.1 +/- 0.1 (jitter = 0.3) 7.726
30      images/sec: 80.0 +/- 0.1 (jitter = 0.3) 7.360
40      images/sec: 80.0 +/- 0.1 (jitter = 0.4) 7.527
50      images/sec: 79.9 +/- 0.1 (jitter = 0.4) 8.171
60      images/sec: 79.9 +/- 0.1 (jitter = 0.4) 7.999
70      images/sec: 79.9 +/- 0.1 (jitter = 0.4) 7.978
80      images/sec: 79.9 +/- 0.1 (jitter = 0.5) 7.884
90      images/sec: 79.8 +/- 0.1 (jitter = 0.5) 7.924
100     images/sec: 79.7 +/- 0.1 (jitter = 0.5) 7.847
----------------------------------------------------------------
total images/sec: 79.72
----------------------------------------------------------------

on vega56 (20.20 branch 27.20.2001.5002) similar to others

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 36.9 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 37.0 +/- 0.0 (jitter = 0.1) 7.854
20      images/sec: 36.9 +/- 0.0 (jitter = 0.1) 7.726
30      images/sec: 36.9 +/- 0.0 (jitter = 0.1) 7.360
40      images/sec: 36.8 +/- 0.0 (jitter = 0.1) 7.526
50      images/sec: 36.8 +/- 0.0 (jitter = 0.2) 8.171
60      images/sec: 36.7 +/- 0.1 (jitter = 0.2) 7.999
70      images/sec: 36.4 +/- 0.1 (jitter = 0.3) 7.978
80      images/sec: 36.2 +/- 0.1 (jitter = 0.3) 7.884
90      images/sec: 36.0 +/- 0.1 (jitter = 0.4) 7.924
100     images/sec: 35.9 +/- 0.1 (jitter = 0.6) 7.848
----------------------------------------------------------------
total images/sec: 35.91
----------------------------------------------------------------

EDIT:
adding CUDA Titan V scores:
so seems 3X improvement vs current DirectML..

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
Step    Img/sec total_loss
1       images/sec: 285.6 +/- 0.0 (jitter = 0.0)        7.765
10      images/sec: 289.8 +/- 1.4 (jitter = 3.9)        8.049
20      images/sec: 289.9 +/- 0.8 (jitter = 1.9)        7.808
30      images/sec: 289.3 +/- 0.7 (jitter = 3.7)        7.976
40      images/sec: 289.8 +/- 0.6 (jitter = 3.8)        7.591
50      images/sec: 289.8 +/- 0.5 (jitter = 3.7)        7.549
60      images/sec: 289.5 +/- 0.4 (jitter = 3.7)        7.819
70      images/sec: 289.4 +/- 0.4 (jitter = 3.8)        7.821
80      images/sec: 289.5 +/- 0.4 (jitter = 3.8)        7.849
90      images/sec: 289.3 +/- 0.4 (jitter = 3.8)        8.027
100     images/sec: 289.4 +/- 0.4 (jitter = 3.8)        8.030
----------------------------------------------------------------
total images/sec: 289.27
----------------------------------------------------------------



python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 235.5 +/- 0.0 (jitter = 0.0)        8.034
10      images/sec: 237.0 +/- 1.2 (jitter = 5.0)        7.686
20      images/sec: 236.5 +/- 0.9 (jitter = 5.0)        7.657
30      images/sec: 236.7 +/- 0.7 (jitter = 5.0)        8.194
40      images/sec: 237.0 +/- 0.6 (jitter = 5.1)        7.897
50      images/sec: 236.9 +/- 0.5 (jitter = 5.0)        7.999
60      images/sec: 236.9 +/- 0.5 (jitter = 4.9)        7.912
70      images/sec: 236.9 +/- 0.4 (jitter = 4.9)        8.180
80      images/sec: 236.9 +/- 0.4 (jitter = 4.9)        8.351
90      images/sec: 236.8 +/- 0.4 (jitter = 4.9)        8.115
100     images/sec: 237.1 +/- 0.4 (jitter = 5.0)        7.822
----------------------------------------------------------------
total images/sec: 237.04
----------------------------------------------------------------

@sofiageo
Copy link

sofiageo commented Jul 3, 2020

Adding my AMD 5700XT results (no overclock) in case it helps verify the expected outcome

Windows 10 2004 - AMD Drivers 20.5.1-ghs-beta

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 36.3 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 37.6 +/- 0.4 (jitter = 1.4) 7.854
20      images/sec: 38.2 +/- 0.3 (jitter = 2.0) 7.726
30      images/sec: 38.5 +/- 0.2 (jitter = 2.0) 7.360
40      images/sec: 38.4 +/- 0.2 (jitter = 2.0) 7.526
50      images/sec: 38.3 +/- 0.2 (jitter = 2.0) 8.171
60      images/sec: 38.1 +/- 0.2 (jitter = 2.0) 7.999
70      images/sec: 38.0 +/- 0.2 (jitter = 2.0) 7.978
80      images/sec: 37.9 +/- 0.1 (jitter = 2.0) 7.884
90      images/sec: 38.0 +/- 0.1 (jitter = 2.0) 7.924
100     images/sec: 37.9 +/- 0.1 (jitter = 2.0) 7.848
----------------------------------------------------------------
total images/sec: 37.94
----------------------------------------------------------------

and updated drivers 20.7.1

Step    Img/sec total_loss
1       images/sec: 39.4 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 39.3 +/- 0.1 (jitter = 0.1) 7.854
20      images/sec: 38.9 +/- 0.1 (jitter = 0.6) 7.726
30      images/sec: 38.7 +/- 0.1 (jitter = 0.6) 7.360
40      images/sec: 38.6 +/- 0.1 (jitter = 0.6) 7.526
50      images/sec: 38.7 +/- 0.1 (jitter = 0.6) 8.171
60      images/sec: 38.8 +/- 0.1 (jitter = 0.6) 7.999
70      images/sec: 38.8 +/- 0.1 (jitter = 0.5) 7.978
80      images/sec: 38.8 +/- 0.1 (jitter = 0.4) 7.884
90      images/sec: 38.8 +/- 0.0 (jitter = 0.4) 7.924
100     images/sec: 38.8 +/- 0.0 (jitter = 0.4) 7.848
----------------------------------------------------------------
total images/sec: 38.80
----------------------------------------------------------------

@limyz
Copy link

limyz commented Jul 6, 2020

Windows 10 2004/AMD Driver v27.20.1017.1011/DirectML

  • AMD R9 290
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server

1       images/sec: 4.9 +/- 0.0 (jitter = 0.0)  8.169
10      images/sec: 4.9 +/- 0.0 (jitter = 0.0)  7.593
20      images/sec: 4.9 +/- 0.0 (jitter = 0.1)  7.696
30      images/sec: 5.0 +/- 0.0 (jitter = 0.1)  7.753
40      images/sec: 5.0 +/- 0.0 (jitter = 0.1)  8.007
50      images/sec: 5.0 +/- 0.0 (jitter = 0.1)  7.520
60      images/sec: 5.0 +/- 0.0 (jitter = 0.1)  7.990
70      images/sec: 5.0 +/- 0.0 (jitter = 0.1)  8.028
80      images/sec: 5.0 +/- 0.0 (jitter = 0.1)  7.931
90      images/sec: 5.0 +/- 0.0 (jitter = 0.1)  7.851
100     images/sec: 5.0 +/- 0.0 (jitter = 0.1)  7.797
----------------------------------------------------------------
total images/sec: 4.96
----------------------------------------------------------------
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --enable_optimizations=0 --trace_file=trace.json

Step    Img/sec total_loss
1       images/sec: 5.3 +/- 0.0 (jitter = 0.0)  8.169
10      images/sec: 5.3 +/- 0.0 (jitter = 0.1)  7.593
20      images/sec: 5.3 +/- 0.0 (jitter = 0.1)  7.696
30      images/sec: 5.2 +/- 0.0 (jitter = 0.1)  7.753
40      images/sec: 5.3 +/- 0.0 (jitter = 0.1)  8.007
50      images/sec: 5.3 +/- 0.0 (jitter = 0.1)  7.519
60      images/sec: 5.3 +/- 0.0 (jitter = 0.1)  7.989
70      images/sec: 5.3 +/- 0.0 (jitter = 0.1)  8.028
80      images/sec: 5.3 +/- 0.0 (jitter = 0.1)  7.933
90      images/sec: 5.3 +/- 0.0 (jitter = 0.1)  7.851
100     images/sec: 5.3 +/- 0.0 (jitter = 0.1)  7.795
----------------------------------------------------------------
total images/sec: 5.27
----------------------------------------------------------------

trace_r290.zip

  • AMD Ryzen 4700U on Vega 7 Graphics
python .\tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50 --variable_update=parameter_server

Step    Img/sec total_loss
1       images/sec: 5.8 +/- 0.0 (jitter = 0.0)  7.993
10      images/sec: 5.8 +/- 0.0 (jitter = 0.0)  7.854
20      images/sec: 5.8 +/- 0.0 (jitter = 0.0)  7.726
30      images/sec: 5.8 +/- 0.0 (jitter = 0.0)  7.360
40      images/sec: 5.7 +/- 0.0 (jitter = 0.0)  7.526
50      images/sec: 5.7 +/- 0.0 (jitter = 0.0)  8.171
60      images/sec: 5.7 +/- 0.0 (jitter = 0.0)  7.999
70      images/sec: 5.7 +/- 0.0 (jitter = 0.0)  7.978
80      images/sec: 5.8 +/- 0.0 (jitter = 0.0)  7.884
90      images/sec: 5.8 +/- 0.0 (jitter = 0.0)  7.924
100     images/sec: 5.8 +/- 0.0 (jitter = 0.0)  7.848
----------------------------------------------------------------
total images/sec: 5.77
----------------------------------------------------------------
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50 --enable_optimizations=0 --trace_file=trace.json

Step    Img/sec total_loss
1       images/sec: 5.7 +/- 0.0 (jitter = 0.0)  7.993
10      images/sec: 6.2 +/- 0.1 (jitter = 0.0)  7.854
20      images/sec: 6.2 +/- 0.0 (jitter = 0.0)  7.726
30      images/sec: 6.2 +/- 0.0 (jitter = 0.0)  7.360
40      images/sec: 6.2 +/- 0.0 (jitter = 0.0)  7.527
50      images/sec: 6.3 +/- 0.0 (jitter = 0.0)  8.171
60      images/sec: 6.3 +/- 0.0 (jitter = 0.0)  7.999
70      images/sec: 6.3 +/- 0.0 (jitter = 0.0)  7.978
80      images/sec: 6.3 +/- 0.0 (jitter = 0.0)  7.884
90      images/sec: 6.3 +/- 0.0 (jitter = 0.0)  7.925
100     images/sec: 6.3 +/- 0.0 (jitter = 0.0)  7.848
----------------------------------------------------------------
total images/sec: 6.27
----------------------------------------------------------------

trace_vega7.zip

@jstoecker jstoecker transferred this issue from microsoft/DirectML Sep 17, 2020
@oscarbg
Copy link

oscarbg commented Sep 20, 2020

Hi,
performance is worse with new update (tensorflow-directml 1.15.3.dev200911)
for example using on Titan V (460.15):
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

now I get:

Step    Img/sec total_loss
1       images/sec: 55.7 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 56.2 +/- 0.1 (jitter = 0.3) 7.854
20      images/sec: 56.3 +/- 0.2 (jitter = 0.4) 7.726
30      images/sec: 56.2 +/- 0.1 (jitter = 0.4) 7.360
40      images/sec: 56.1 +/- 0.1 (jitter = 0.5) 7.527
50      images/sec: 56.1 +/- 0.1 (jitter = 0.5) 8.171
60      images/sec: 55.5 +/- 0.3 (jitter = 0.5) 7.999
70      images/sec: 55.5 +/- 0.2 (jitter = 0.7) 7.978
80      images/sec: 55.5 +/- 0.2 (jitter = 0.7) 7.884
90      images/sec: 55.4 +/- 0.2 (jitter = 0.8) 7.924
100     images/sec: 55.4 +/- 0.2 (jitter = 0.8) 7.847
----------------------------------------------------------------
total images/sec: 55.39
----------------------------------------------------------------

on June was getting:

Step    Img/sec total_loss
1       images/sec: 80.1 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 80.0 +/- 0.3 (jitter = 0.6) 7.854
20      images/sec: 80.1 +/- 0.1 (jitter = 0.3) 7.726
30      images/sec: 80.0 +/- 0.1 (jitter = 0.3) 7.360
40      images/sec: 80.0 +/- 0.1 (jitter = 0.4) 7.527
50      images/sec: 79.9 +/- 0.1 (jitter = 0.4) 8.171
60      images/sec: 79.9 +/- 0.1 (jitter = 0.4) 7.999
70      images/sec: 79.9 +/- 0.1 (jitter = 0.4) 7.978
80      images/sec: 79.9 +/- 0.1 (jitter = 0.5) 7.884
90      images/sec: 79.8 +/- 0.1 (jitter = 0.5) 7.924
100     images/sec: 79.7 +/- 0.1 (jitter = 0.5) 7.847
----------------------------------------------------------------
total images/sec: 79.72
----------------------------------------------------------------

I get on console output lots of messages like posted below..
this seems to point to the performance issue as seems to do now some work on DirectML CPU backend (DML CPU):

2020-09-20 21:29:44.368137: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:DML:0
  /job:localhost/replica:0/task:0/device:DML:1
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[DML, CPU] possible_devices_=[]
Assign: DML CPU
Const: DML CPU
VariableV2: DML CPU
Identity: DML CPU
ApplyGradientDescent: DML CPU
IsVariableInitialized: DML CPU

seems wasn't getting this on June testing..
full log of the run:
fullogdmlcpu.txt

@PatriceVignola
Copy link
Contributor

@oscarbg Were you not getting these logs with the previous package? I think these logs are expected when running tf_cnn_benchmarks since it doesn't know anything about DML, but it should still try to fallback to DML instead of the CPU when possible. We'll investigate the performance regression.

@metemadi
Copy link

metemadi commented Nov 3, 2020

Wanted to add my results! First off this is some amazing work - really helps increase the accessibility of ML tools!!! I am running a Zephyrus G14 which has two GPUs, the integrated Radeon and a 2060-MaxQ. Running in windows for now (will try WSL-2 soon as well). Can confirm that I can get bigger batch sizes using the Radeon (which has access to 40GB of installed system ram - yes.. 8gb soldered with a 32gb stick.. yay!) than I can with the Max-Q (6gb vram only). This really opens up a lot of possibilities.. but.. its all just pretty slow compared to the CPU (the 4900HS).

First.. the CPU:

python tf_cnn_benchmarks.py --batch_size=16 --model=resnet50 --enable_optimizations=0 --device='cpu' --data_format='NHWC' --num_batches=30
Step    Img/sec total_loss
1       images/sec: 3.8 +/- 0.0 (jitter = 0.0)  7.780
10      images/sec: 3.7 +/- 0.0 (jitter = 0.0)  7.877
20      images/sec: 3.7 +/- 0.0 (jitter = 0.1)  7.744
30      images/sec: 3.7 +/- 0.0 (jitter = 0.1)  7.672
----------------------------------------------------------------
total images/sec: 3.69
----------------------------------------------------------------

Next - the Radeon (I had to disable the Max-Q in device manager.. couldn't find an easy way to make the gigantic Tensorflow benchmark pick /dml:1 which is the Radeon, /dml:0 being the Max-Q):

python tf_cnn_benchmarks.py --batch_size=16 --model=resnet50 --enable_optimizations=0 --data_format='NHWC' --num_batches=30
Step    Img/sec total_loss
1       images/sec: 3.2 +/- 0.0 (jitter = 0.0)  7.780
10      images/sec: 3.2 +/- 0.0 (jitter = 0.0)  7.877
20      images/sec: 3.2 +/- 0.0 (jitter = 0.0)  7.744
30      images/sec: 3.2 +/- 0.0 (jitter = 0.0)  7.672
----------------------------------------------------------------
total images/sec: 3.20
----------------------------------------------------------------

Sadly slower than the CPU, with access to the same amount of memory so no difference in batch size there. Last but not least, the nvidia 2060-MaxQ:

#same exact command as above
Done warm up
Step    Img/sec total_loss
1       images/sec: 17.6 +/- 0.0 (jitter = 0.0) 7.780
10      images/sec: 17.6 +/- 0.0 (jitter = 0.1) 7.877
20      images/sec: 17.5 +/- 0.0 (jitter = 0.1) 7.744
30      images/sec: 17.5 +/- 0.0 (jitter = 0.1) 7.672
----------------------------------------------------------------
total images/sec: 17.51
----------------------------------------------------------------

I expect CUDA performance with WSL-2/Cuda to be a lot better.. but that's not the point here! All in all - as performance improves this will really change the game for ML beginners and pros alike!! Excited to be able to, AC-922 style, access the entire system memory for optimized GPU computations. Thanks again @microsoft for this important work.

@PatriceVignola
Copy link
Contributor

@metemadi Thank you for the additional data points - your setup looks awesome for ML!

We're currently focused on improving stability and coverage, but the next step is obviously to be way more competitive with CUDA. So far we've been focused on coverage from the ai-benchmark models, but the TF benchmarks repo is something we're starting to look into.

@oscarbg
Copy link

oscarbg commented May 14, 2021

just pointing new 202104 release is much faster..
on Titan V nearly 2x faster than last year (june) release..
with that now CUDA is only 50% faster..

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50


Step    Img/sec total_loss
1       images/sec: 162.1 +/- 0.0 (jitter = 0.0)        8.169
10      images/sec: 177.8 +/- 23.1 (jitter = 2.0)       7.593
20      images/sec: 177.5 +/- 11.7 (jitter = 2.3)       7.696
30      images/sec: 177.2 +/- 11.3 (jitter = 1.7)       7.753
40      images/sec: 177.1 +/- 9.0 (jitter = 1.5)        8.007
50      images/sec: 176.7 +/- 7.2 (jitter = 1.4)        7.520
60      images/sec: 176.7 +/- 6.0 (jitter = 1.3)        7.988
70      images/sec: 176.3 +/- 5.2 (jitter = 1.6)        8.027
80      images/sec: 176.4 +/- 4.6 (jitter = 1.5)        7.931
90      images/sec: 176.3 +/- 4.1 (jitter = 1.6)        7.851
100     images/sec: 176.2 +/- 3.7 (jitter = 1.9)        7.794
----------------------------------------------------------------
total images/sec: 176.09
----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 166.5 +/- 0.0 (jitter = 0.0)        7.993
10      images/sec: 161.9 +/- 1.3 (jitter = 3.3)        7.854
20      images/sec: 161.6 +/- 1.1 (jitter = 3.0)        7.726
30      images/sec: 161.7 +/- 6.9 (jitter = 2.1)        7.360
40      images/sec: 161.6 +/- 5.2 (jitter = 2.9)        7.527
50      images/sec: 161.5 +/- 4.2 (jitter = 2.8)        8.171
60      images/sec: 161.4 +/- 3.5 (jitter = 2.4)        7.999
70      images/sec: 161.4 +/- 3.0 (jitter = 2.5)        7.978
80      images/sec: 161.3 +/- 2.7 (jitter = 2.9)        7.883
90      images/sec: 161.3 +/- 3.2 (jitter = 2.9)        7.924
100     images/sec: 161.1 +/- 2.9 (jitter = 2.9)        7.847
----------------------------------------------------------------
total images/sec: 161.00
----------------------------------------------------------------

maybe this issue can be closed?

thanks..

@oscarbg
Copy link

oscarbg commented May 14, 2021

confirm RX Vega also gets 2X speedup vs latest results:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 76.2 +/- 0.0 (jitter = 0.0) 8.169
10      images/sec: 76.0 +/- 0.2 (jitter = 0.4) 7.593
20      images/sec: 75.9 +/- 0.1 (jitter = 0.4) 7.696
30      images/sec: 75.8 +/- 0.1 (jitter = 0.5) 7.753
40      images/sec: 75.7 +/- 0.1 (jitter = 0.5) 8.007
50      images/sec: 75.6 +/- 0.1 (jitter = 0.6) 7.520
60      images/sec: 75.5 +/- 0.1 (jitter = 0.5) 7.988
70      images/sec: 75.5 +/- 0.1 (jitter = 0.5) 8.027
80      images/sec: 75.4 +/- 0.1 (jitter = 0.5) 7.932
90      images/sec: 75.3 +/- 0.1 (jitter = 0.7) 7.850
100     images/sec: 75.1 +/- 0.1 (jitter = 0.8) 7.797
----------------------------------------------------------------
total images/sec: 75.11
----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 69.7 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 70.4 +/- 0.2 (jitter = 0.3) 7.854
20      images/sec: 70.3 +/- 0.1 (jitter = 0.5) 7.726
30      images/sec: 68.1 +/- 2.2 (jitter = 0.5) 7.360
40      images/sec: 68.4 +/- 1.9 (jitter = 0.6) 7.527
50      images/sec: 68.7 +/- 1.5 (jitter = 0.6) 8.171
60      images/sec: 68.9 +/- 1.3 (jitter = 0.5) 7.999
70      images/sec: 69.0 +/- 1.1 (jitter = 0.5) 7.978
80      images/sec: 69.0 +/- 1.0 (jitter = 0.5) 7.884
90      images/sec: 69.1 +/- 0.9 (jitter = 0.5) 7.924
100     images/sec: 69.1 +/- 0.8 (jitter = 0.5) 7.848
----------------------------------------------------------------
total images/sec: 69.12
----------------------------------------------------------------

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants