-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Improvement] speed up confusion matrix calculation #465
Conversation
Codecov Report
@@ Coverage Diff @@
## master #465 +/- ##
=======================================
Coverage 84.67% 84.67%
=======================================
Files 118 118
Lines 8347 8348 +1
Branches 1366 1365 -1
=======================================
+ Hits 7068 7069 +1
Misses 932 932
Partials 347 347
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Any numbers on benchmark? |
classes = 10
length = 100
times = 10000
start = time.time()
for i in range(times):
labels = np.random.randint(0, classes, length).astype(np.int64)
pred = np.random.randint(0, classes, length).astype(np.int64)
new_confusion_matrix(pred, labels) # using np.bincount
end = time.time()
print(end - start)
start = time.time()
for i in range(times):
labels = np.random.randint(0, classes, length).astype(np.int64)
pred = np.random.randint(0, classes, length).astype(np.int64)
confusion_matrix(pred, labels) # original one
end = time.time()
print(end - start) 0.5623 vs 1.1669 |
classes = 100
length = 10
times = 10000
start = time.time()
for i in range(times):
labels = np.random.randint(0, classes, length).astype(np.int64)
pred = np.random.randint(0, classes, length).astype(np.int64)
new_confusion_matrix(pred, labels)
end = time.time()
print(end - start)
start = time.time()
for i in range(times):
labels = np.random.randint(0, classes, length).astype(np.int64)
pred = np.random.randint(0, classes, length).astype(np.int64)
confusion_matrix(pred, labels)
end = time.time()
print(end - start) 0.5215 vs 0.5446 |
The modification can truly speed up the calculation |
index_pred = label_map[plabel] | ||
confusion_mat[index_real][index_pred] += 1 | ||
max_label = label_set[-1] | ||
label_map = np.zeros(max_label + 1, dtype=np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to use a dictionary instead of np array here, as the original one:
label_map = {label: i for i, label in enumerate(label_set)}
so that 4 lines -> 1 line
any performance concern here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
label_map = {label: i for i, label in enumerate(label_set)}
y_pred_mapped = [label_map[i] for i in y_pred]
y_real_mapped = [label_map[i] for i in y_real]
This is much slower even than the original one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numbers would be appreciated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, the for loop is much slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal by kenny VS The original one before this PR:
for classes = 10, length = 100, times = 10000
, 1.2643 vs 1.1462
for classes = 10, length = 100, times = 10000
, 0.6333 vs 0.5554
even worse
This PR uses
np.bincount
to speed up confusion matrix calculation