Skip to content

Onnx kill switch #779

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 35 commits into from
Jun 16, 2021
Merged

Onnx kill switch #779

merged 35 commits into from
Jun 16, 2021

Conversation

alonre24
Copy link
Collaborator

@alonre24 alonre24 commented Jun 6, 2021

This PR introduces the "ONNX kill switch" - a mechanism for terminating onnx sessions whose runtime exceeds a certain threshold. It is done by registering a callback to Redis cron event loop (runs by default every 100 ms) upon loading onnx backend. This callback will use the "kill" API that ONNX has for every session that its runtime exceeds the timeout. Note that:

  • The ONNX_TIMEOUT is a load time config, default is 5000 ms.
  • We use a global structure that saves the onnx run session in an array, where every background thread always updates a designated entry within the array (to minimize locking)
  • Whenever a new model is stored in RedisAI and it is set to run on a new device, the global onnx sessions array is extended (as the number of working threads in the system grows).
  • This PR includes refactoring of the RunQueueInfo API
  • This PR includes refactoring of the load time config logic (so we will be more strict when we get an error)
  • This PR includes refactoring of the backend loading (mainly cosmetic changes)

@alonre24 alonre24 requested review from lantiga and DvirDukhan June 6, 2021 15:14
@codecov
Copy link

codecov bot commented Jun 6, 2021

Codecov Report

Merging #779 (285b5be) into master (dfc0963) will increase coverage by 0.77%.
The diff coverage is 73.04%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #779      +/-   ##
==========================================
+ Coverage   79.22%   79.99%   +0.77%     
==========================================
  Files          50       52       +2     
  Lines        7662     7767     +105     
==========================================
+ Hits         6070     6213     +143     
+ Misses       1592     1554      -38     
Impacted Files Coverage Δ
src/backends/libtflite_c/tflite_c.cpp 57.60% <0.00%> (-2.50%) ⬇️
src/backends/libtorch_c/torch_c.cpp 62.26% <0.00%> (-1.18%) ⬇️
src/backends/tensorflow.c 69.04% <0.00%> (-1.55%) ⬇️
src/backends/tflite.c 66.01% <0.00%> (ø)
src/backends/torch.c 81.71% <0.00%> (ø)
src/execution/utils.c 67.85% <ø> (ø)
src/serialization/AOF/rai_aof_rewrite.c 0.00% <0.00%> (ø)
src/backends/backends.c 75.36% <55.96%> (+12.20%) ⬆️
src/execution/run_queue_info.c 65.90% <65.90%> (ø)
src/execution/parsing/deprecated.c 81.86% <71.42%> (+1.19%) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update debcf77...285b5be. Read the comment docs.

Comment on lines +29 to +36
if (strcasecmp((key), "TF") == 0) {
ret = RAI_LoadBackend(ctx, RAI_BACKEND_TENSORFLOW, (val));
} else if (strcasecmp((key), "TFLITE") == 0) {
ret = RAI_LoadBackend(ctx, RAI_BACKEND_TFLITE, (val));
} else if (strcasecmp((key), "TORCH") == 0) {
ret = RAI_LoadBackend(ctx, RAI_BACKEND_TORCH, (val));
} else if (strcasecmp((key), "ONNX") == 0) {
ret = RAI_LoadBackend(ctx, RAI_BACKEND_ONNXRUNTIME, (val));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this loading section here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called from ai.config LOADBACKEND <backend> <backend_path> command.

alonre24 added 2 commits June 10, 2021 17:36
…om backend len (supported only for onnx now) in INFO MODULES command.
alonre24 added 2 commits June 13, 2021 16:22
- Add a state flag to every entry in the onnx run sessions array and update it atomically, to avoid situations where main threads and bg thread both access the runOptions field.
- Refactor the info_modules section, and change AI.INFO command so it must receive a module/script key. The other info will be accessible as part of the info modules command.
…eated from start, number of threads in this case should be larger.
@alonre24 alonre24 added ci-test and removed ci-test labels Jun 13, 2021
@alonre24 alonre24 added ci-test and removed ci-test labels Jun 13, 2021
@alonre24 alonre24 added ci-test and removed ci-test labels Jun 15, 2021
Copy link
Collaborator

@DvirDukhan DvirDukhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job!

@alonre24 alonre24 merged commit 1fdc22b into master Jun 16, 2021
@alonre24 alonre24 deleted the ONNX_kill_switch branch June 16, 2021 06:37
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants