-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Improve] Speed up data preprocessor. #1064
Conversation
Codecov ReportBase: 0.02% // Head: 91.32% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## dev-1.x #1064 +/- ##
============================================
+ Coverage 0.02% 91.32% +91.29%
============================================
Files 121 128 +7
Lines 8217 9509 +1292
Branches 1368 1498 +130
============================================
+ Hits 2 8684 +8682
+ Misses 8215 639 -7576
- Partials 0 186 +186
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@@ -3,6 +3,7 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can be removed to keep consistent with other config file
data['inputs'] = inputs | ||
|
||
return data | ||
data_samples = data.get('data_samples', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To help the user for understanding our codebase, maybe we need to update the document by adding some description of the data_samples
used in MMClassification(with text or link to MMEngine related document), this could be added into the document TODO list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can create an issue and add it to Oct. TODO list
25b549d
to
08dd17c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* [Improve] Speed up data preprocessor. * Add ClsDataSample serialization override functions. * Add unit tests * Modify configs to fit new mixup args. * Fix `num_classes` of the ImageNet-21k config. * Update docs.
Motivation
The original data preprocessor consumes much time because it needs to cast the label tensors of samples one by one.
Modification
ClsDataPreprocessoor
, only castgt_label
of data samples to cuda instead of the whole data sample. And before casting, stack/concat the tensors.Mixup
. Now it directly processes the batch inputs and batch scores (one-hot format labels), and won't accept normal labels or data samples.ForkingPicker
, convert all tensors to NumPy array and convert them back during deserialization. This is to decrease the consumption of file descriptors in the dataloader.Here is the speed comparison in MobileNetV2 (batch size 64)
BC-breaking (Optional)
The
Mixup
,CutMix
, andResizeMix
classes won't accept thenum_classes
argument, and it becomes the argument ofClsDataPreprocessor
.In the config files, the changes are as below:
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
Checklist
Before PR:
After PR: