Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix clickhouse client race during batch commit #4071

Merged
merged 1 commit into from
Aug 3, 2022

Conversation

wsquan171
Copy link
Contributor

@wsquan171 wsquan171 commented Aug 2, 2022

This change fixes an unsafe deque access in flow aggregator clickhouse client. Reads of the deque including Len() and At() are not protected with mutex during batchCommitAll(), which could run concurrently with another routine writing items to deque with CacheSet().

To avoid holding the mutex for extended period of time, the following behavior is adopted:

  1. (With mutex) Pop front #Len() items off deque and store in buffer
  2. (Without mutex) Transform records in buffer and batch export them with SQL driver. This step is where I/O happens.
  3. (With mutex) If SQL commit rolled back, PushFront records in buffer back to deque. Oldest records will be discarded if deque full.

Also bumped github.com/gammazero/deque to v0.1.2.

@wsquan171
Copy link
Contributor Author

/test-all

@codecov
Copy link

codecov bot commented Aug 2, 2022

Codecov Report

Merging #4071 (5354f99) into main (b66ffda) will increase coverage by 3.16%.
The diff coverage is 94.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4071      +/-   ##
==========================================
+ Coverage   64.21%   67.38%   +3.16%     
==========================================
  Files         294      293       -1     
  Lines       44555    44288     -267     
==========================================
+ Hits        28612    29844    +1232     
+ Misses      13612    12143    -1469     
+ Partials     2331     2301      -30     
Flag Coverage Δ
e2e-tests 40.63% <ø> (?)
integration-tests 35.19% <ø> (?)
kind-e2e-tests 50.20% <63.15%> (-0.52%) ⬇️
unit-tests 44.34% <94.73%> (+0.07%) ⬆️
Impacted Files Coverage Δ
...lowaggregator/clickhouseclient/clickhouseclient.go 83.37% <94.73%> (+0.39%) ⬆️
pkg/ipfix/ipfix_intermediate.go 0.00% <0.00%> (-90.91%) ⬇️
pkg/ipfix/ipfix_collector.go 0.00% <0.00%> (-83.34%) ⬇️
pkg/flowaggregator/certificate.go 0.00% <0.00%> (-75.53%) ⬇️
pkg/config/flowaggregator/default.go 0.00% <0.00%> (-43.75%) ⬇️
pkg/agent/cniserver/ipam/antrea_ipam.go 52.81% <0.00%> (-21.22%) ⬇️
pkg/controller/networkpolicy/tier.go 50.00% <0.00%> (-5.00%) ⬇️
pkg/agent/controller/networkpolicy/packetin.go 69.71% <0.00%> (-4.23%) ⬇️
...agent/flowexporter/connections/deny_connections.go 81.72% <0.00%> (-3.23%) ⬇️
pkg/agent/memberlist/cluster.go 71.24% <0.00%> (-2.88%) ⬇️
... and 62 more

@wsquan171
Copy link
Contributor Author

/test-e2e
/test-integration
/test-conformance
/test-networkpolicy

@wsquan171 wsquan171 marked this pull request as ready for review August 2, 2022 23:34
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for taking care of this

@@ -431,11 +433,18 @@ func (ch *ClickHouseExportProcess) batchCommitAll() (int, error) {
}

// populate items from deque
recordsToExport := make([]*ClickHouseFlowRow, 0, currSize)
ch.mutex.Lock()
for i := 0; i < currSize; i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's worth adding a comment explaining why currSize is guaranteed to still be "valid", i.e. at most equal to the actual queue size (it's possible for the queue to have grown since the earlier check, but not possible for the queue to have shrunk). If this is not true, then the code is not correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's correct because CacheSet can never shrink the queue by itself. The following loop:

for ch.deque.Len() >= ch.queueSize {
        ch.deque.PopFront()
}

will never remove more than 1 item, and we push a new item right after with ch.deque.PushBack(chRow), so overall the size of the queue either increases by 1 or stays the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can rather read again within mutex context to be 100% safe and save the detailed explanation (which depends on implementation and could be forgotten to be updated). adding a comment for why it could have changed instead of explaining why it's still correct.

@@ -487,24 +496,33 @@ func (ch *ClickHouseExportProcess) batchCommitAll() (int, error) {

if err != nil {
klog.ErrorS(err, "Error when adding record")
ch.pushFrontRecords(recordsToExport)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new implementation makes sense, given that the likelihood of an error is small

pkg/flowaggregator/clickhouseclient/clickhouseclient.go Outdated Show resolved Hide resolved
This change fixes an unsafe deque access in flow aggregator clickhouse
client. Reads of the deque including Len() and At() are not protected
with mutex during batchCommitAll(), which could run concurrently with
another routine writing items to deque with CacheSet().

To avoid holding the mutex for extended period of time, the following
behavior is adopted:
1. (With mutex) Pop front #Len() items off deque and store in buffer
2. (Without mutex) Transform records in buffer and batch export them
   with SQL driver. This step is where I/O happens.
3. (With mutex) If SQL commit rolled back, PushFront records in buffer
   back to deque. Oldest records will be discarded if deque full.

Also bumped github.com/gammazero/deque to v0.1.2.

Signed-off-by: Shawn Wang <wshaoquan@vmware.com>
@wsquan171
Copy link
Contributor Author

/test-e2e
/test-integration
/test-conformance
/test-networkpolicy

@wsquan171
Copy link
Contributor Author

/test-integration

@antoninbas antoninbas merged commit 13bdcd3 into antrea-io:main Aug 3, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants