Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-107] Add Fused Vanilla RNN and dropout for CPU #11399

Merged
merged 1 commit into from
Jun 26, 2018

Conversation

lihaofd
Copy link
Contributor

@lihaofd lihaofd commented Jun 26, 2018

Description

In this PR, it creates Fused Vanilla RNN(tanh/relu) operator and dropout of GRU/LSTM/vRNN for CPU.
@pengzhao-intel, @TaoLv

Feature changes

New features

  • Single layer/Multiple layer and unidirectional/bidirectional Vanilla RNN(tanh/relu), including both forward and backward computation.
  • Support dropout of GRU/LSTM/vRNN

Unit-test changes

  • Create new testcase in tests/python/unittests/test_operator.py.
  • update testcase in example/rnn/bucketing/cudnn_rnn_bucketing.py
  • Check consistency with original RNNCell implementation.

Performance

We have tested performance of FusedRNN and NonFused RNNCell on our local Skylake-8180 with 2 Sockets and 56 cores. Use MKL as blas lib in this performance test.
Test input size is from DS2 default parameters(seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800).

Layer=1 bidirectional = False

API Inference time(fwd, samples/sec) Training time(fwd + bwd, samples/sec)
rnn.RNNCell - NoFusedRNN(Tanh, CPU) 492.61 198.02
this PR - FusedRNN(Tanh, CPU) 952.38 318.98
speedup 1.93x 1.61x
API Inference time(fwd, samples/sec) Training time(fwd + bwd, samples/sec)
rnn.RNNCell - NoFusedRNN(Relu, CPU) 277.78 104.17
this PR - FusedRNN(Relu, CPU) 740.74 177
speedup 2.67x 1.7x

Layer=5 bidirectional = True

API Inference time(fwd, samples/sec) Training time(fwd + bwd, samples/sec)
rnn.RNNCell - NoFusedRNN(Tanh, CPU) 38.91 22.73
rnn.RNNCell (Tanh, cuda) 47.85 26.95
rnn.RNNCell (Tanh, cudnn) 208.33 81.63
this PR - FusedRNN(Tanh, CPU) 104.17 34.01
speedup -this PR/RNNCell (Tanh, CPU) 267.7% 149.7%
speedup -this PR/RNNCell (Tanh, cuda) 217.7% 126.2%
speedup -this PR/RNNCell (Tanh, cudnn) 50% 41.7%
API Inference time(fwd, samples/sec) Training time(fwd + bwd, samples/sec)
rnn.RNNCell - NoFusedRNN(Relu, CPU) 40.73 22.6
rnn.RNNCell (Relu, cuda) 52.91 26.81
rnn.RNNCell (Relu, cudnn) 206.83 82.64
this PR - FusedRNN(Relu, CPU) 134.23 35.97
speedup -this PR/RNNCell (Relu, CPU) 329.5% 159.2%
speedup -this PR/RNNCell (Relu, cuda) 253.7% 134.2%
speedup -this PR/RNNCell (Relu, cudnn) 64.9% 43.5%

Convergency Curve

We have tested Convergency of FusedGRU/LSTM(dropout = 0.5) on our CPU-Skylake-8180 with 2 Sockets and 56 cores and GPU-P100 by using example/rnn/bucketing/cudnn_rnn_bucketing.py
Test input size is layer = 3, batch_size = 32, num-embed = 800, num-hidden = 800, num-epochs 20
gru_dropout
lstm_dropout

@szha: resolves #10870, #10872

@lihaofd lihaofd requested a review from szha as a code owner June 26, 2018 00:41
@szha szha self-assigned this Jun 26, 2018
@TaoLv
Copy link
Member

TaoLv commented Jun 26, 2018

Please remove [WIP] from the title and add the JIRA number to it. https://issues.apache.org/jira/browse/MXNET-107

@lihaofd lihaofd changed the title [WIP] Add Fused Vanilla RNN and dropout [MXNET-107] Add Fused Vanilla RNN and dropout Jun 26, 2018
@lihaofd lihaofd changed the title [MXNET-107] Add Fused Vanilla RNN and dropout [MXNET-107] Add Fused Vanilla RNN and dropout for CPU Jun 26, 2018
@piiswrong piiswrong merged commit 0538ad9 into apache:master Jun 26, 2018
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RNN operator should support rnn_tanh and rnn_relu mode on CPU
4 participants