-
Notifications
You must be signed in to change notification settings - Fork 106
Reuse memory in TENSORSET #540
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #540 +/- ##
==========================================
+ Coverage 75.74% 75.76% +0.02%
==========================================
Files 25 25
Lines 5384 5401 +17
==========================================
+ Hits 4078 4092 +14
- Misses 1306 1309 +3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
small comment
@@ -814,7 +868,14 @@ int RAI_parseTensorSetArgs(RedisModuleCtx *ctx, RedisModuleString **argv, int ar | |||
size_t datalen; | |||
const char *data; | |||
DLDataType datatype = RAI_TensorDataTypeFromString(typestr); | |||
*t = RAI_TensorCreateWithDLDataType(datatype, dims, ndims, tensorAllocMode); | |||
if (datafmt == REDISAI_DATA_BLOB) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you are checking it here
the switch
in line 889 is redundant since it checks only a single case. I think you can move its content to the else block in line 875
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
e37f506
to
52c4235
Compare
I've quickly ran the benchmark that produced the best result for autobatching variation ( 60 clients, autobatching 30 ) ( command Notice that at ~= 400 inferences/sec we have an expected BW of 224 * 224 * 3 * 4bytes * 400 / ( 1024^3 ) = 0.23 GB/sec as show bellow by the table.
If we look at the numbers we notice that the memory BW reduction was of aroud 0.3 GB/sec ( matches the expected ) and the expected improvement of 5% both in the overall ops/sec and inference latency.
Detail of memory BW by using the minor page faults/sec counter of e06c663( the drops mean a new test iteration ( 3 iterations ) ) Detail of memory BW by using the minor page faults/sec counter of 52c4235( the drops mean a new test iteration ( 3 iterations ) ) Bottom line we see that this generally improves the inference performance ( even when it's faded by modelrun+tensorget) so I would merge this asap and work on further reducing the overall mem bandwidth by for example pushing forward the investigation of reusing tensor allocated memory on backends ( example of tensorflow with dlpack inputs ) |
Addresses #515 by reusing memory allocated in
argv
inTENSORSET
.