-
Notifications
You must be signed in to change notification settings - Fork 349
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
distributed provisioning: allowed topologies + immediate binding #612
Conversation
9183218
to
cfabf52
Compare
/test pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-master |
@pohly: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The whole |
/assign @verult |
pkg/controller/controller.go
Outdated
@@ -1293,6 +1293,32 @@ func (p *csiProvisioner) checkNode(ctx context.Context, claim *v1.PersistentVolu | |||
return false, nil | |||
} | |||
|
|||
// If the storage class as AllowedTopologies set, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: has
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
cfabf52
to
d3f761e
Compare
pkg/controller/controller.go
Outdated
// If the storage class has AllowedTopologies set, then | ||
// it must match our own. We can find out by trying to | ||
// create accessibility requirements. If that fails, | ||
// we should become the owner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not become the owner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duh. Of course. Fixed.
claim.Name, | ||
sc.AllowedTopologies, | ||
node, | ||
p.strictTopology, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If strictTopology
is false, does GenerateAccessibilityRequirements()
return an error if node
isn't part of AllowedTopologies
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should. The "strict topology" case is just a simpler version of that existing, older check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sg. I haven't been able to find the old logic but since we have test coverage this should be great. I'll find the logic eventually 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually still couldn't find the logic for this when strictTopology
is false. If you happen to know off-hand could you point me to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic must exist somewhere because if it doesn't, GenerateAccessibilityRequirements
would return a topology and no error. checkNode()
then proceeds and tries to becomeOwner()
, in which case either
- it succeeds and selected node is updated, or
- it fails and returns an error, and the caller,
ShouldProvision()
, proceeds to returntrue
.
In both cases the distributed immediate, allowed topologies not okay
test case should've failed (expectedSelectedNode: ""
and expectNoProvision: true
)
d3f761e
to
e8dfeb3
Compare
} | ||
|
||
if expectSelectedNode != "" { | ||
if tc.volOpts.PVC != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you quickly explain why this check was added? (just for my understanding, not necessarily as a code comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the sake of completeness. All other pointers were also checked for non-nil.
expectErr: true, | ||
expectState: controller.ProvisioningNoChange, | ||
skipCreateVolume: true, | ||
expectNoProvision: true, // not owner and will not change that either |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior to the fix, the test will fail at expectNoProvision
, while expectErr
, expectState
, and skipCreateVolume
will pass, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would have passed expectNoProvision
, but for the wrong reason (not owner yet). Without the fix, it fails expectSelectedNode
because the provisioner would set the current node to itself. The correct behavior is to not provision and not set the node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. The behavior of selected node is a little non-obvious to me, and I had trouble making sense of what behavior should be expected from the test case parameters. Given enough time I should be able to figure it out but I couldn't quite grasp it just by scanning the test case.
Just a nit - not sure what could improve it but maybe a comment explaining the scenario could help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe consider adding a comment that selecting a node == becoming owner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment.
/lgtm |
When determining who should provision a PVC with immediate binding, each instance must check the AllowedTopologies before trying to become the owner of the PVC, because if the allowed topology does not include the current node, then all provisioning attempts will fail with "error generating accessibility requirements". To test this, the provisioning tests get refactored (ShouldProvision gets tested separately) and extended (new test cases with AllowedTopologies). While at it, the redundant printing of the test case name gets removed. It's redundant because sub tests are now used, so each error message is automatically associated with its test.
ShouldProvision must not try to become the owner if capacity is insufficient. That was implemented already, just not tested.
e8dfeb3
to
f262e0f
Compare
I do, and did 😁 Please re-approve. |
/lgtm |
/assign @msau42 |
p.csiNodeLister, | ||
p.nodeLister); err != nil { | ||
if logger.Enabled() { | ||
logger.Infof("%s: ignoring PVC %s/%s, allowed topologies is not compatible: %v", caller, claim.Namespace, claim.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call can error for other reasons besides incompatible allowedTopologies. Will this correctly recover in that case? Do we need a more specific error code to look for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which failures? When deployed on a node, fake CSINode and Node listers are used (
external-provisioner/cmd/csi-provisioner/csi-provisioner.go
Lines 304 to 340 in 213cd3d
// Avoid watching in favor of fake, static objects. This is particularly relevant for | |
// Node objects, which can generate significant traffic. | |
csiNode := &storagev1.CSINode{ | |
ObjectMeta: metav1.ObjectMeta{ | |
Name: nodeDeployment.NodeName, | |
}, | |
Spec: storagev1.CSINodeSpec{ | |
Drivers: []storagev1.CSINodeDriver{ | |
{ | |
Name: provisionerName, | |
NodeID: nodeDeployment.NodeInfo.NodeId, | |
}, | |
}, | |
}, | |
} | |
node := &v1.Node{ | |
ObjectMeta: metav1.ObjectMeta{ | |
Name: nodeDeployment.NodeName, | |
}, | |
} | |
if nodeDeployment.NodeInfo.AccessibleTopology != nil { | |
for key := range nodeDeployment.NodeInfo.AccessibleTopology.Segments { | |
csiNode.Spec.Drivers[0].TopologyKeys = append(csiNode.Spec.Drivers[0].TopologyKeys, key) | |
} | |
node.Labels = nodeDeployment.NodeInfo.AccessibleTopology.Segments | |
} | |
klog.Infof("using local topology with Node = %+v and CSINode = %+v", node, csiNode) | |
// We make those fake objects available to the topology code via informers which | |
// never change. | |
stoppedFactory := informers.NewSharedInformerFactory(clientset, 1000*time.Hour) | |
csiNodes := stoppedFactory.Storage().V1().CSINodes() | |
nodes := stoppedFactory.Core().V1().Nodes() | |
csiNodes.Informer().GetStore().Add(csiNode) | |
nodes.Informer().GetStore().Add(node) | |
csiNodeLister = csiNodes.Lister() | |
nodeLister = nodes.Lister() |
I think all that remains is an unsuitable selected node, and therefore we don't need a more detailed error.
After having described it like this, the code could also be replaced with a direct comparison of the selected node. I'm undecided whether that is better: if GenerateAccessibilityRequirements
then fails after the node became owner (against all expectations), then the volume is stuck. More code would be needed to handle that, like detecting this particular problem earlier and/or recovering earlier.
Given that immediate binding isn't recommended when using distributed provisioning, I think I prefer to merge this simpler solution even if it doesn't cover some edge cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. It wasn't obvious to me from a quick review of the code that this is the case. I think from a general readability and maintenance standpoint, it's better to code more defensively in case the function changes in the future and we can no longer make this assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a specific error and check for it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate error code, or a new condition, or perhaps a more targeted function like you mentioned would all work. Out of all those, I like the separate method approach the most because there's no confusion about what parts of the function are/are not relevant depending on the codepath.
Anyway, I think we can leave this to a future refactoring. For now, this works with the current assumptions.
/approve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
When using immediate binding with allowed topologies set in the storage class together with distributed provisioning, node instances must ignore PVCs that do not fit their node.
Which issue(s) this PR fixes:
Fixes #611
Special notes for your reviewer:
Includes some drive-by test enhancements.
Does this PR introduce a user-facing change?:
/cc @chrishenzie