Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

csi: nil-check allocs for claim methods #7760

Merged
merged 2 commits into from
Apr 21, 2020
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Apr 20, 2020

For #7712

Although #7708 has a longer-term fix, this patch should prevent a reported panic.

Copy link
Contributor

@langmartin langmartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, it looks like errors are handled down through the client, this seems good until the other fix.

@notnoop
Copy link
Contributor

notnoop commented Apr 20, 2020

Mind if you elaborate on how this fixes the issue.

Reading the stack trace in issue #7712, I suspect something else is happening and that it's due to CSIVolumeDenormalize not checking for not found allocs properly and potentially insert a nil alloc reference in vol.{ReadAllocs|WriteAllocs}.

Here is my analysis:

The exception happens in csi.go:399:

// Check the blocking allocations to see if they belong to this job
for _, a := range v.WriteAllocs {
	if a.Namespace != alloc.Namespace || a.JobID != alloc.JobID {

I believe that alloc in this function is not-nil. applyCSIVolumeClaim checks if alloc is nil in fsm.go#L1172 and returns an error otherwise. So in line 1181, alloc != nil, and the rest of methods involved pass the alloc as-id, in state_store.go:2049, csi.go:362, and as such alloc must be non-nil in csi.go:399.

Alternatively, my guess is that v.WriteAllocs contain a nil alloc entry. Looking at CSIVolumeDenormalize, I notice that it doesn't handle nils returned by s.AllocByID(..), which result into a nil in WriteAllocs map, and dereferencing it in ClaimWrite result the panic. I suspect this happens due to allocs being GCed elsewhere such that the alloc is no longer found?

Incidentally, I think the error check in https://github.com/hashicorp/nomad/blob/v0.11.0/nomad/fsm.go#L1174-L1176 is dead code. We already check for error nilness earlier.

@tgross
Copy link
Member Author

tgross commented Apr 21, 2020

Mind if you elaborate on how this fixes the issue.

On reflection, it's probably not the best approach... the dangers of hasty patches.

Reading the stack trace in issue #7712, I suspect something else is happening and that it's due to CSIVolumeDenormalize not checking for not found allocs properly and potentially insert a nil alloc reference in vol.{ReadAllocs|WriteAllocs}.

You're right there... I think was an unfortunate design artifact from the original pass at the GC process that we never really cleaned up. At this point collectClaimsToGCImpl ignores nil-Allocs so I don't think there's any reason to allow nil-Allocs to get passed around anymore. Let me take a quick pass at fixing that and testing it end-to-end before I say so for sure.

Incidentally, I think the error check in https://github.com/hashicorp/nomad/blob/v0.11.0/nomad/fsm.go#L1174-L1176 is dead code. We already check for error nilness earlier.

Agreed.

Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Ship it!

@tgross
Copy link
Member Author

tgross commented Apr 21, 2020

@notnoop I've given this a smoke test in e2e as follows:

  • run the EBS plugins
  • register a volume
  • run a job that uses that volume, and check it has written into the volume
  • stop the job, verify the volume is detached from the host
  • run the job again, and check that the volume contains the writes from both versions
  • stop the job again, verify the volume is detached from the host
  • disable eligibility for the node where that job ran (just to make sure there's no funny business with client state)
  • run the job one more time (now landing on a different node), and check that the volume contains the writes from all 3 versions

@tgross tgross merged commit 9f5156a into master Apr 21, 2020
@tgross tgross deleted the b_csi_nil_check_claim branch April 21, 2020 12:32
@github-actions
Copy link

github-actions bot commented Jan 9, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 9, 2023
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants