Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

CUDA support #57

Open
dllu opened this issue Feb 7, 2021 · 4 comments
Open

CUDA support #57

dllu opened this issue Feb 7, 2021 · 4 comments

Comments

@dllu
Copy link

dllu commented Feb 7, 2021

I am interested in getting onnxruntime-rs running with CUDA based inference. (I'm also interested in getting AMDMIGraphX inference working but that's a whole nother can of worms)

Anyway in onnxruntime-rs/onnxruntime-sys/examples/c_api_sample.rs there is:

c_api_sample.rs:52:    // E.g. for CUDA include cuda_provider_factory.h and uncomment the following line:
c_api_sample.rs:53:    // OrtSessionOptionsAppendExecutionProvider_CUDA(sessionOptions, 0);

But uncommenting the line doesn't work since symbols OrtSessionOptionsAppendExecutionProvider_CUDA and sessionOptions are not available.

Also generally the CUDA doesn't seem to be working since it is still using CPU for inferencing even though I compiled with

ORT_LIB_LOCATION /usr/local/
ORT_STRATEGY system
ORT_USE_CUDA 1

with onnxruntime compiled with ./build.sh --use_cuda --cudnn_home /usr/ --cuda_home /opt/cuda/ --config RelWithDebInfo --parallel --build_shared_lib and installed in /usr/local.

@dllu
Copy link
Author

dllu commented Feb 7, 2021

hmmm I was able to compile it by adding cuda_provider_factory.h to the wrapper.h and changing

-    // OrtSessionOptionsAppendExecutionProvider_CUDA(sessionOptions, 0);
+    unsafe {
+        OrtSessionOptionsAppendExecutionProvider_CUDA(session_options_ptr, 0);
+    }
onnxruntime-sys » cargo run --release --example c_api_sample
    Finished release [optimized] target(s) in 0.03s
     Running `/home/dllu/builds/onnxruntime-rs/target/release/examples/c_api_sample`
Using Onnxruntime C API
2021-02-07 14:22:25.905008087 [I:onnxruntime:, inference_session.cc:225 operator()] Flush-to-zero and denormal-as-zero are off
2021-02-07 14:22:25.905026622 [I:onnxruntime:, inference_session.cc:232 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
2021-02-07 14:22:26.065756648 [I:onnxruntime:test, bfc_arena.cc:23 BFCArena] Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 memory limit: 18446744073709551615 arena_extend_strategy 0
2021-02-07 14:22:26.065778619 [V:onnxruntime:test, bfc_arena.cc:41 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-02-07 14:22:26.065789219 [I:onnxruntime:test, bfc_arena.cc:23 BFCArena] Creating BFCArena for CudaPinned with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 memory limit: 18446744073709551615 arena_extend_strategy 0
2021-02-07 14:22:26.065794719 [V:onnxruntime:test, bfc_arena.cc:41 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-02-07 14:22:26.065800780 [I:onnxruntime:test, bfc_arena.cc:23 BFCArena] Creating BFCArena for CUDA_CPU with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 memory limit: 18446744073709551615 arena_extend_strategy 0
2021-02-07 14:22:26.065805940 [V:onnxruntime:test, bfc_arena.cc:41 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-02-07 14:22:26.065829023 [I:onnxruntime:, inference_session.cc:1083 Initialize] Initializing session.
2021-02-07 14:22:26.065835946 [I:onnxruntime:, inference_session.cc:1108 Initialize] Adding default CPU execution provider.
2021-02-07 14:22:26.065841687 [I:onnxruntime:test, bfc_arena.cc:23 BFCArena] Creating BFCArena for Cpu with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 memory limit: 18446744073709551615 arena_extend_strategy 0
2021-02-07 14:22:26.065847277 [V:onnxruntime:test, bfc_arena.cc:41 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-02-07 14:22:26.068332739 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2021-02-07 14:22:26.068805883 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2021-02-07 14:22:26.069032486 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2021-02-07 14:22:26.069129407 [V:onnxruntime:, inference_session.cc:877 TransformGraph] Node placements
2021-02-07 14:22:26.069134326 [V:onnxruntime:, inference_session.cc:879 TransformGraph] All nodes have been placed on [CUDAExecutionProvider].
2021-02-07 14:22:26.069315986 [V:onnxruntime:, session_state.cc:76 CreateGraphInfo] SaveMLValueNameIndexMapping
2021-02-07 14:22:26.069354908 [V:onnxruntime:, session_state.cc:122 CreateGraphInfo] Done saving OrtValue mappings.
2021-02-07 14:22:26.398456273 [I:onnxruntime:test, bfc_arena.cc:23 BFCArena] Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 memory limit: 18446744073709551615 arena_extend_strategy 0
2021-02-07 14:22:26.398479366 [V:onnxruntime:test, bfc_arena.cc:41 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-02-07 14:22:26.398711981 [I:onnxruntime:, session_state_utils.cc:100 SaveInitializedTensors] Saving initialized tensors.
2021-02-07 14:22:26.398872991 [I:onnxruntime:, session_state_utils.cc:170 SaveInitializedTensors] [Memory] SessionStateInitializer statically allocates 4942848 bytes for Cuda

2021-02-07 14:22:26.401419217 [I:onnxruntime:, session_state_utils.cc:212 SaveInitializedTensors] Done saving initialized tensors
2021-02-07 14:22:26.401758791 [I:onnxruntime:, inference_session.cc:1258 Initialize] Session successfully initialized.
Number of inputs = 1
Input 0 : name=data_0
Input 0 : type=1
Input 0 : num_dims=4
Input 0 : dim 0=1
Input 0 : dim 1=3
Input 0 : dim 2=224
Input 0 : dim 3=224
2021-02-07 14:22:26.401885488 [I:onnxruntime:test, bfc_arena.cc:280 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:11 rounded_bytes:602112
2021-02-07 14:22:26.401899093 [I:onnxruntime:test, bfc_arena.cc:165 Extend] Extended allocation by 1048576 bytes.
2021-02-07 14:22:26.401904694 [I:onnxruntime:test, bfc_arena.cc:168 Extend] Total allocated bytes: 5991424
2021-02-07 14:22:26.401912919 [I:onnxruntime:test, bfc_arena.cc:171 Extend] Allocated memory at 0x7fc264e46c00 to 0x7fc264f46c00
2021-02-07 14:22:26.401968072 [I:onnxruntime:, sequential_executor.cc:157 Execute] Begin execution
2021-02-07 14:22:26.401978962 [I:onnxruntime:test, bfc_arena.cc:280 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:13 rounded_bytes:3154176
2021-02-07 14:22:26.402159369 [I:onnxruntime:test, bfc_arena.cc:165 Extend] Extended allocation by 4194304 bytes.
2021-02-07 14:22:26.402164990 [I:onnxruntime:test, bfc_arena.cc:168 Extend] Total allocated bytes: 10185728
2021-02-07 14:22:26.402171021 [I:onnxruntime:test, bfc_arena.cc:171 Extend] Allocated memory at 0x7fc265c00000 to 0x7fc266000000
2021-02-07 14:22:26.425433288 [I:onnxruntime:test, bfc_arena.cc:280 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:17 rounded_bytes:33554432
2021-02-07 14:22:26.425716587 [I:onnxruntime:test, bfc_arena.cc:165 Extend] Extended allocation by 33554432 bytes.
2021-02-07 14:22:26.425722548 [I:onnxruntime:test, bfc_arena.cc:168 Extend] Total allocated bytes: 43740160
2021-02-07 14:22:26.425726035 [I:onnxruntime:test, bfc_arena.cc:171 Extend] Allocated memory at 0x7fc21a000000 to 0x7fc21c000000
2021-02-07 14:22:26.546222827 [I:onnxruntime:test, bfc_arena.cc:280 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:17 rounded_bytes:33554432
2021-02-07 14:22:26.546430695 [I:onnxruntime:test, bfc_arena.cc:165 Extend] Extended allocation by 33554432 bytes.
2021-02-07 14:22:26.546439421 [I:onnxruntime:test, bfc_arena.cc:168 Extend] Total allocated bytes: 77294592
2021-02-07 14:22:26.546443970 [I:onnxruntime:test, bfc_arena.cc:171 Extend] Allocated memory at 0x7fc218000000 to 0x7fc21a000000
2021-02-07 14:22:26.563948623 [I:onnxruntime:test, bfc_arena.cc:280 AllocateRawInternal] Extending BFCArena for CudaPinned. bin_num:0 rounded_bytes:256
2021-02-07 14:22:26.564005620 [I:onnxruntime:test, bfc_arena.cc:165 Extend] Extended allocation by 1048576 bytes.
2021-02-07 14:22:26.564011200 [I:onnxruntime:test, bfc_arena.cc:168 Extend] Total allocated bytes: 1048576
2021-02-07 14:22:26.564017101 [I:onnxruntime:test, bfc_arena.cc:171 Extend] Allocated memory at 0x7fc265a00200 to 0x7fc265b00200
2021-02-07 14:22:26.580391023 [I:onnxruntime:, sequential_executor.cc:475 Execute] [Memory] ExecutionFrame dynamically allocates 10003456 bytes for Cuda

2021-02-07 14:22:26.580407113 [I:onnxruntime:test, bfc_arena.cc:280 AllocateRawInternal] Extending BFCArena for Cpu. bin_num:4 rounded_bytes:4096
2021-02-07 14:22:26.580417713 [I:onnxruntime:test, bfc_arena.cc:165 Extend] Extended allocation by 1048576 bytes.
2021-02-07 14:22:26.580422953 [I:onnxruntime:test, bfc_arena.cc:168 Extend] Total allocated bytes: 1048576
2021-02-07 14:22:26.580428403 [I:onnxruntime:test, bfc_arena.cc:171 Extend] Allocated memory at 0x556cf8b023c0 to 0x556cf8c023c0
Score for class [0] =  0.000045440655
Score for class [1] =  0.0038458658
Score for class [2] =  0.00012494654
Score for class [3] =  0.0011804515
Score for class [4] =  0.0013169352
Done!

@dllu
Copy link
Author

dllu commented Feb 7, 2021

Oh yeah actually I was able to get CUDA based inferencing working with just

diff --git a/onnxruntime-sys/wrapper.h b/onnxruntime-sys/wrapper.h
index e63d352..c7c0cde 100644
--- a/onnxruntime-sys/wrapper.h
+++ b/onnxruntime-sys/wrapper.h
@@ -1 +1,2 @@
-#include "onnxruntime_c_api.h"
+#include "onnxruntime/core/providers/cuda/cuda_provider_factory.h"
+#include "onnxruntime/core/session/onnxruntime_c_api.h"
diff --git a/onnxruntime/src/session.rs b/onnxruntime/src/session.rs
index c3b6e88..53d2b0b 100644
--- a/onnxruntime/src/session.rs
+++ b/onnxruntime/src/session.rs
@@ -125,6 +125,14 @@ impl<'a> SessionBuilder<'a> {
         Ok(self)
     }

+    /// Use CUDA
+    pub fn use_cuda(self) -> Result<SessionBuilder<'a>> {
+        unsafe {
+            sys::OrtSessionOptionsAppendExecutionProvider_CUDA(self.session_options_ptr, 0);
+        }
+        Ok(self)
+    }
+
     /// Set the session's allocator
     ///
     /// Defaults to [`AllocatorType::Arena`](../enum.AllocatorType.html#variant.Arena)

and then regenerating the bindings and building with the ORT_USE_GPU environment variables and stuff. On my machine a Titan Xp with CUDA is about 8 to 10 times faster than using the CPU (AMD Ryzen 9 3900x).

Not sure how to make these changes work with people who don't use CUDA though. Maybe need some kind of cfg thing.

hooray

@dllu dllu changed the title Uncommenting suggested line in c_api_sample.rs doesn't work CUDA support Feb 8, 2021
@nbigaouette
Copy link
Owner

Great! I'm glad you were able to make it work.

As you found out, uncommenting the lines in the example will not work; it's a copy-paste from the original C example which I left when I ported the example. See https://github.com/microsoft/onnxruntime/blob/v1.4.0/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/C_Api_Sample.cpp#L41-L43

From your last patch it seems there is no need for a cfg since the API exists in the runtime.

I'm not sure about how the function is being called though. If you look at how different functions are called, there is a difference. For example in with_number_threads() the function is accessed from g_ort() and the status return value is checked for error.

I don't have access to an nvidia system for now so it's hard for me to test this...

@dllu
Copy link
Author

dllu commented Feb 9, 2021

Unlike SetIntraOpNumThreads, it seems that the function OrtSessionOptionsAppendExecutionProvider_CUDA exists in the global scope and you don't need to call it through g_ort(). However, we should probably check its return value for error.

relevant documentation from onnxruntime_c_api.h:

  /**
    * To use additional providers, you must build ORT with the extra providers enabled. Then call one of these
    * functions to enable them in the session:
    *   OrtSessionOptionsAppendExecutionProvider_CPU
    *   OrtSessionOptionsAppendExecutionProvider_CUDA
    *   OrtSessionOptionsAppendExecutionProvider_<remaining providers...>
    * The order they are called indicates the preference order as well. In other words call this method
    * on your most preferred execution provider first followed by the less preferred ones.
    * If none are called Ort will use its internal CPU execution provider.
    */

The cfg is needed because it may fail to link if the ORT isn't compiled with the CUDA execution provider.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants