[MXNET-107] Add Fused Vanilla RNN and dropout for CPU #11399

lihaofd · 2018-06-26T00:41:47Z

Description

In this PR, it creates Fused Vanilla RNN(tanh/relu) operator and dropout of GRU/LSTM/vRNN for CPU.
@pengzhao-intel, @TaoLv

Feature changes

New features

Single layer/Multiple layer and unidirectional/bidirectional Vanilla RNN(tanh/relu), including both forward and backward computation.
Support dropout of GRU/LSTM/vRNN

Unit-test changes

Create new testcase in tests/python/unittests/test_operator.py.
update testcase in example/rnn/bucketing/cudnn_rnn_bucketing.py
Check consistency with original RNNCell implementation.

Performance

We have tested performance of FusedRNN and NonFused RNNCell on our local Skylake-8180 with 2 Sockets and 56 cores. Use MKL as blas lib in this performance test.
Test input size is from DS2 default parameters(seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800).

Layer=1 bidirectional = False

API	Inference time(fwd, samples/sec)	Training time(fwd + bwd, samples/sec)
rnn.RNNCell - NoFusedRNN(Tanh, CPU)	492.61	198.02
this PR - FusedRNN(Tanh, CPU)	952.38	318.98
speedup	1.93x	1.61x

API	Inference time(fwd, samples/sec)	Training time(fwd + bwd, samples/sec)
rnn.RNNCell - NoFusedRNN(Relu, CPU)	277.78	104.17
this PR - FusedRNN(Relu, CPU)	740.74	177
speedup	2.67x	1.7x

Layer=5 bidirectional = True

API	Inference time(fwd, samples/sec)	Training time(fwd + bwd, samples/sec)
rnn.RNNCell - NoFusedRNN(Tanh, CPU)	38.91	22.73
rnn.RNNCell (Tanh, cuda)	47.85	26.95
rnn.RNNCell (Tanh, cudnn)	208.33	81.63
this PR - FusedRNN(Tanh, CPU)	104.17	34.01
speedup -this PR/RNNCell (Tanh, CPU)	267.7%	149.7%
speedup -this PR/RNNCell (Tanh, cuda)	217.7%	126.2%
speedup -this PR/RNNCell (Tanh, cudnn)	50%	41.7%

API	Inference time(fwd, samples/sec)	Training time(fwd + bwd, samples/sec)
rnn.RNNCell - NoFusedRNN(Relu, CPU)	40.73	22.6
rnn.RNNCell (Relu, cuda)	52.91	26.81
rnn.RNNCell (Relu, cudnn)	206.83	82.64
this PR - FusedRNN(Relu, CPU)	134.23	35.97
speedup -this PR/RNNCell (Relu, CPU)	329.5%	159.2%
speedup -this PR/RNNCell (Relu, cuda)	253.7%	134.2%
speedup -this PR/RNNCell (Relu, cudnn)	64.9%	43.5%

Convergency Curve

We have tested Convergency of FusedGRU/LSTM(dropout = 0.5) on our CPU-Skylake-8180 with 2 Sockets and 56 cores and GPU-P100 by using example/rnn/bucketing/cudnn_rnn_bucketing.py
Test input size is layer = 3, batch_size = 32, num-embed = 800, num-hidden = 800, num-epochs 20

@szha: resolves #10870, #10872

TaoLv · 2018-06-26T01:20:11Z

Please remove [WIP] from the title and add the JIRA number to it. https://issues.apache.org/jira/browse/MXNET-107

add vRNN and dropout

9f6a4bf

lihaofd requested a review from szha as a code owner June 26, 2018 00:41

szha self-assigned this Jun 26, 2018

szha requested review from reminisce, sxjscience and eric-haibin-lin June 26, 2018 00:44

lihaofd changed the title ~~[WIP] Add Fused Vanilla RNN and dropout~~ [MXNET-107] Add Fused Vanilla RNN and dropout Jun 26, 2018

lihaofd changed the title ~~[MXNET-107] Add Fused Vanilla RNN and dropout~~ [MXNET-107] Add Fused Vanilla RNN and dropout for CPU Jun 26, 2018

piiswrong merged commit 0538ad9 into apache:master Jun 26, 2018

szha mentioned this pull request Jul 18, 2018

enable CPU kernel for all RNN layer forward #11807

Merged

4 tasks

pengzhao-intel mentioned this pull request Aug 27, 2018

update release news #12342

Merged

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

add vRNN and dropout (apache#11399)

0a0dce2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-107] Add Fused Vanilla RNN and dropout for CPU #11399

[MXNET-107] Add Fused Vanilla RNN and dropout for CPU #11399

lihaofd commented Jun 26, 2018 •

edited by szha

Loading

TaoLv commented Jun 26, 2018

[MXNET-107] Add Fused Vanilla RNN and dropout for CPU #11399

[MXNET-107] Add Fused Vanilla RNN and dropout for CPU #11399

Conversation

lihaofd commented Jun 26, 2018 • edited by szha Loading

Description

Feature changes

New features

Unit-test changes

Performance

Convergency Curve

TaoLv commented Jun 26, 2018

lihaofd commented Jun 26, 2018 •

edited by szha

Loading