Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Make table deletion a schema change so that truncation doesn't race with other people still using the table #5907

Closed
wants to merge 5 commits into from

Conversation

andreimatei
Copy link
Contributor

This change is Reviewable

// called multiple times if retries occur: make sure it does not have side
// effects.
// Publish updates a table descriptor. It also maintains the invariant that
// there's at most two versions of the descriptor out in the wild at any time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are

@vivekmenezes
Copy link
Contributor

Review status: 0 of 31 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


sql/structured.proto, line 259 [r5] (raw file):
I see that you have added the deleted bit in here as a mutation. Is there a reason why the deleted bit is not a part of the table descriptor itself? The reason I ask is because I do not like the descriptor is missing comment you have above. I'd much prefer deletion not be a mutation, because in my mind it's not a mutation to the descriptor. We just want to delete the descriptor.

Can you just delete all mutations and mark the table descriptor as deleted? That will disable all the mutation related work going on the table, and get it ready for deletion.


sql/parser/expr.go, line 536 [r5] (raw file):
And if the qualified name is not a table name?


Comments from Reviewable

@andreimatei
Copy link
Contributor Author

@vivekmenezes I've removed the mutation. I've also added some other utility commits. PTAL when you can.


Review status: 0 of 35 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


sql/structured.proto, line 259 [r5] (raw file):
done


Comments from Reviewable

@vivekmenezes
Copy link
Contributor

Thanks for making this change. I do like it but there is one more improvement worth making here or creating an issue over.

Once the table descriptor has been marked deleted and there are no more leases on the previous version, it would be nice to rename the table to GC (or equivalent) so that going forward all schema change commands and other commands like creating a table with the same name behave nicely. If you have a very large table you need not have to wait until the table is completely GC-ed before you can reuse the name.

The above change will also make the thinking around everything else a lot easier. You don't have to worry about new schema change mutations being added for instance to the old descriptor.


Review status: 0 of 35 files reviewed at latest revision, 30 unresolved discussions, some commit checks failed.


roachpb/errors.go, line 556 [r13] (raw file):
return "descriptor deleted"


roachpb/errors.proto, line 193 [r13] (raw file):
is there value to having the id in DescriptorDeletedError as well?


sql/alter_table.go, line 182 [r13] (raw file):
See Ben's recent change. This is going to clash with that.


sql/descriptor.go, line 133 [r13] (raw file):
why named return values? If it's for the comment above just say return false, etc


sql/drop.go, line 172 [r13] (raw file):
for _, name := range n.Names {


sql/drop.go, line 228 [r13] (raw file):
You really don't need a new mutationID here. See what's done in Rename Index


sql/lease.go, line 493 [r13] (raw file):
Why do we need this?


sql/lease.go, line 571 [r13] (raw file):
Not if it's a transaction that spans many requests.


sql/lease.go, line 667 [r13] (raw file):
Perhaps we should have the calls on one table descriptor cleanup the other tables or something like that.


sql/lease_test.go, line 348 [r13] (raw file):
better to return an error


sql/lease_test.go, line 393 [r13] (raw file):
Exec()


sql/lease_test.go, line 405 [r13] (raw file):
Let's acquire a lease on the table descriptor here before we drop the table and release it so we have the cached version.


sql/lease_test.go, line 407 [r13] (raw file):
Exec()


sql/lease_test.go, line 418 [r13] (raw file):
place on line above


sql/plan.go, line 389 [r13] (raw file):
No good reason for this tmp


sql/schema_changer.go, line 65 [r13] (raw file):
It will be nice to not use a planner since we already dropped it from backfill


sql/schema_changer.go, line 90 [r13] (raw file):
please add this line to the getTableDescriptorAtVersion() function after rebasing


sql/schema_changer.go, line 142 [r13] (raw file):
I think it's best to not add the error all over the place. Look at my comment below.


sql/schema_changer.go, line 246 [r13] (raw file):
Wanna add a comment like:

From this point forward it is guaranteed that no node will change the underlying data on the table with the exception of the backfill. It is important to grab a schema change lease on the table now to prevent some backfill operation from adding more data. Schema change commands can continue executing on the table descriptor. That's fine, but we don't want any backfill being triggered. I suppose that's protected by the fact that the backfill checks whether the descriptor has been deleted.


sql/schema_changer.go, line 652 [r13] (raw file):
No. The schema change will not be present on the table descriptor in the kv store.


sql/structured.proto, line 257 [r13] (raw file):
It is illegal to transition deleted = true to deleted = false


sql/parser/expr.go, line 540 [r13] (raw file):

0


Comments from Reviewable

@vivekmenezes
Copy link
Contributor

The more i think about this, lets make drop table super fast and decouple
the data deletion from it by gcing the table data in the async schema
changer.

I'm happy to chat about this before you proceed

On Fri, Apr 8, 2016, 1:29 PM Andrei Matei notifications@github.com wrote:

@vivekmenezes https://github.com/vivekmenezes I've removed the

mutation. I've also added some other utility commits. PTAL when you can.

Review status: 0 of 35 files reviewed at latest revision, 8 unresolved

discussions, some commit checks failed.

sql/structured.proto, line 259 [r5]
https://reviewable.io:443/reviews/cockroachdb/cockroach/5907#-KEl94X96Y8F6Mzfjpzr:-KEquSfsBkhj2IUBarcN:677588487
(raw file

optional bool deleted = 18 [(gogoproto.nullable) = false];
):

done

Comments from Reviewable
https://reviewable.io:443/reviews/cockroachdb/cockroach/5907#-:-KEquWeGO9d235f8M1q5:1624310144


You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub
#5907 (comment)

@petermattis
Copy link
Collaborator

We even have an issue filed to optimize DROP/TRUNCATE:
#2003.

On Fri, Apr 8, 2016 at 4:01 PM, vivekmenezes notifications@github.com
wrote:

The more i think about this, lets make drop table super fast and decouple
the data deletion from it by gcing the table data in the async schema
changer.

I'm happy to chat about this before you proceed

On Fri, Apr 8, 2016, 1:29 PM Andrei Matei notifications@github.com
wrote:

@vivekmenezes https://github.com/vivekmenezes I've removed the

mutation. I've also added some other utility commits. PTAL when you can.

Review status: 0 of 35 files reviewed at latest revision, 8 unresolved

discussions, some commit checks failed.

*sql/structured.proto, line 259 [r5]
<
https://reviewable.io:443/reviews/cockroachdb/cockroach/5907#-KEl94X96Y8F6Mzfjpzr:-KEquSfsBkhj2IUBarcN:677588487

(raw file
<

optional bool deleted = 18 [(gogoproto.nullable) = false];

):*

done

*Comments from Reviewable
<
https://reviewable.io:443/reviews/cockroachdb/cockroach/5907#-:-KEquWeGO9d235f8M1q5:1624310144
*


You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub
<
https://github.com/cockroachdb/cockroach/pull/5907#issuecomment-207526851>


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#5907 (comment)

@andreimatei
Copy link
Contributor Author

Review status: 0 of 35 files reviewed at latest revision, 30 unresolved discussions, some commit checks failed.


roachpb/errors.go, line 556 [r13] (raw file):
Done.


roachpb/errors.proto, line 193 [r13] (raw file):
i don't think so. That's an internal error, we never print it or return it to the user.


sql/descriptor.go, line 133 [r13] (raw file):
Done.


sql/drop.go, line 172 [r13] (raw file):
Done.


sql/drop.go, line 228 [r13] (raw file):
Done.


sql/lease.go, line 217 [r2] (raw file):
Done.


sql/lease.go, line 220 [r2] (raw file):
Done.


sql/lease.go, line 493 [r13] (raw file):
well it's needed because that's this acquire is how leases are refreshed. If we just return here, the releaseNonLatest below doesn't run and so we never release leases for deleted tables.


sql/lease_test.go, line 348 [r13] (raw file):
meh, it's a test


sql/lease_test.go, line 393 [r13] (raw file):
Done.


sql/lease_test.go, line 405 [r13] (raw file):
That's not so easy, since there's no releasing happening in this test (I blocked the schema changers from running above because otherwise I can't reliably perform an acquire while the descriptor is in the deleted state, as opposed to actually deleted). This is also the reason why we do acquire with a bogus node id below.


sql/lease_test.go, line 407 [r13] (raw file):
Done.


sql/lease_test.go, line 418 [r13] (raw file):
doesn't fit


sql/plan.go, line 389 [r13] (raw file):
Done.


sql/schema_changer.go, line 65 [r13] (raw file):
Done.


sql/schema_changer.go, line 90 [r13] (raw file):
Done.


sql/schema_changer.go, line 142 [r13] (raw file):
sorry, which comment is that?
This error is useful here for informing the async schema changers runner that it can drop the schema changer for this deletion. Otherwise, nothing would remove it from the list of schema changers to run.


sql/schema_changer.go, line 246 [r13] (raw file):
expanded the comment above a bit.
I don't think it's worth complicating the comment further, talking about backfill. Indeed, backfill doesn't run if the table is found to have deleted set.


sql/schema_changer.go, line 652 [r13] (raw file):
ack


sql/structured.proto, line 257 [r13] (raw file):
Done.


sql/table.go, line 46 [r2] (raw file):
Done.


sql/table.go, line 47 [r2] (raw file):
meh. This function is only used in a handful of places that want perr


sql/table.go, line 217 [r2] (raw file):
"Ex" is extended :). Win32 API convention, I think.

Why do you like p.getDescriptor() (i.e. having the caller pass in a proto to be populated versus returning a (pointer to) proto)? To save the proto allocation? In fact, I think we should change that interface.


sql/table.go, line 263 [r2] (raw file):
Done.


sql/parser/expr.go, line 536 [r5] (raw file):
if it's not a table name, you get something bogus. Garbage in, garbage out. Added a comment.
Or are you suggesting putting it some checks like in NormalizeTableName()? I think that'd be overkill.


sql/parser/expr.go, line 540 [r13] (raw file):
Done.


Comments from Reviewable

@andreimatei
Copy link
Contributor Author

@vivekmenezes as we discussed, let's do async truncate in a future PR


Review status: 0 of 35 files reviewed at latest revision, 30 unresolved discussions.


Comments from Reviewable

@vivekmenezes
Copy link
Contributor

Review status: 0 of 35 files reviewed at latest revision, 37 unresolved discussions, some commit checks failed.


sql/backfill.go, line 104 [r22] (raw file):
How can this happen. Shouldn't getTableDescFromID() return an error?


sql/config.go, line 67 [r22] (raw file):
probably makes sense for it to return an error?


sql/lease.go, line 493 [r13] (raw file):
Instead of doing this here, can we capture all this special logic in refreshMaybe (refreshLeases) so that the code path called from the async path is the only one releasing the old leases on the table.


sql/lease.go, line 130 [r22] (raw file):
move this above the version check above


sql/lease_test.go, line 411 [r22] (raw file):
Can you use the same logic used in TestRaceWithBackfill which runs the schema change in a goroutine and waits for a particular version after which all leases will be denied. I'd also add some rows into the table so that the truncation takes some time.


sql/schema_changer.go, line 142 [r13] (raw file):
Actually leave this out and create a new issue for me. I'd rather you merge this in in a state where the truncation logic starts running after all the mutations have been processed. So the deletion on its own for now will not delete the outstanding mutations. In a separate change I'm making a number of changes and I can see how I can add in the mutation deletion logic better without requiring these changes. Create an issue for me "speed up table drop by disabling schema changers before truncating table"


sql/schema_changer.go, line 191 [r22] (raw file):
For now truncate and drop the table if there are no outstanding mutations on the table.


sql/schema_changer.go, line 549 [r22] (raw file):
you also need to enqueue for table deletion. Remember Version++ and table deletion happen in two separate transactions, a schema changer can run version++ and then fail resulting in the table never getting deleted. Add a comment here to that effect.


sql/structured.go, line 642 [r22] (raw file):
I actually do not like the use a boolean here. I think there is value to having two separate methods

setUpVersion() // only sets upversion

newMutationID() // gets new mutation id and sets upversion.

Add a TODO to eliminate newMutationID() in favor of appendMutation() or something more sensible.


Comments from Reviewable

@andreimatei
Copy link
Contributor Author

Review status: 0 of 35 files reviewed at latest revision, 37 unresolved discussions, some commit checks failed.


sql/backfill.go, line 104 [r22] (raw file):
let's leave it retuning nil. It's convenient for SchemaChangeManager.IsDone


sql/config.go, line 67 [r22] (raw file):
GetZoneConfig above also returns nil. I think that's better than an error (easier to test for).


sql/lease.go, line 493 [r13] (raw file):
OK, moved to refreshMaybe


sql/lease.go, line 130 [r22] (raw file):
Done.


sql/lease_test.go, line 411 [r22] (raw file):
but in this test we don't want to run any schema changes. We just mark the table for deletion and verify that random nodes can't get a lease on it (so there's no Publish call).


sql/schema_changer.go, line 142 [r13] (raw file):
as we discussed, this is still needed


sql/schema_changer.go, line 191 [r22] (raw file):
I tried to do it but it leads to pretty ugly code, duplicating some call to sc.truncateAndDropTable() both in the short-circuit below for invalidMutationID and at the end of the method.
Besides, we can in fact run the deletion even if there are queued mutations, so I'd leave it the way it is for now.


sql/schema_changer.go, line 549 [r22] (raw file):
done, good catch


sql/structured.go, line 642 [r22] (raw file):
Done.


Comments from Reviewable

@vivekmenezes
Copy link
Contributor

I've taken another pass at this. Most of the comments are in the lease and schema change area, and very little is in the rest of the refactoring. It would make me very happy if you can take lease and schema changer out of this change and do the basic refactoring stuff in a separate PR, or consolidate this entire PR into three commits

  1. refactoring, 2. leases, 3. schema change changes.

Thanks!


Review status: 0 of 35 files reviewed at latest revision, 49 unresolved discussions, some commit checks failed.


sql/drop_test.go, line 284 [r30] (raw file):
Can you add some more data into the table and then make this test similar to TestRaceWithBackfill where DROP TABLE is called in a goroutine and the below SELECT is called immediately after the version has incremented. Thanks


sql/lease.go, line 492 [r30] (raw file):
As I said earlier I don't believe you need to purge the old leases here. Just do them in purgeOldLeases called from the async path.


sql/lease.go, line 496 [r30] (raw file):
Why did you remove the call to releaseNonlatest() below? I suppose you want to depend on the async path. I think it's okay to take it out.


sql/lease.go, line 522 [r30] (raw file):
Add a comment that this method depends on there being at least one active lease


sql/lease.go, line 590 [r30] (raw file):
I think we should rename this method to purgeOldLeases()


sql/lease.go, line 598 [r30] (raw file):
No strong reason to define this variable


sql/lease.go, line 613 [r30] (raw file):
I'm very confused by this comment. Why can't you use Acquire?


sql/lease.go, line 626 [r30] (raw file):
I'd swallow the DescriptorDeletedError here


sql/lease.go, line 636 [r30] (raw file):
I don't think we need this extra call to release


sql/schema_changer.go, line 171 [r30] (raw file):
Do you really need this needRelease stuff. Can ReleaseLease return an error instead.

Perhaps what you want is to not run the truncate and delete code through exec() and instead call its own method deleteTable(). See purgeMutations().


sql/schema_changer.go, line 178 [r30] (raw file):
I see that you changed the signature of Publish() to return the table descriptor. Is there a strong reason to do that?


sql/schema_changer.go, line 598 [r30] (raw file):
Interesting. Are you sure about that?


Comments from Reviewable

On DROP TABLE, we now synchronously mark the table as DELETED and set
up_version. The schema changer will then publish this change. This
ensures (in conjunction with the next commit) that leases will not be
handed out any more.  Then, we asynchronously truncate and delete the
table, now that nobody's writing on it any more.
Fixes cockroachdb#5483
Fixes cockroachdb#5484

Also eliminates a race in leasing:
LeaseManager.RefreshLease() was reading tableState.active without
locking.
Before this we have DisableSyncSchemaChanges, but that guy can only be
called before the server is created (otherwise it races with schema
changers). I need to be able to let schema changes run in the beginning
of the test, and block them later on. This will be used in the next
commit.
@andreimatei
Copy link
Contributor Author

@vivekmenezes the other refactoring PR has gone in. Mind taking another look?
You can now even "combine commits for review" and it's not so bad :). Thanks!


Review status: 0 of 26 files reviewed at latest revision, 49 unresolved discussions, some commit checks pending.


sql/drop_test.go, line 284 [r30] (raw file):
but what's the point? Isn't it a good thing that we test the sync schema changer? This is the way we want the workflow to generally work - user DROPs and then she immediately fails to use the table, in the same session.


sql/lease.go, line 492 [r30] (raw file):
OK you're right, lease releasing can be done exclusively in the gossip handler


sql/lease.go, line 496 [r30] (raw file):
right, now that async path is the only one doing releases


sql/lease.go, line 522 [r30] (raw file):
does it depend on that?


sql/lease.go, line 590 [r30] (raw file):
Done.


sql/lease.go, line 598 [r30] (raw file):
isn't the inline function bad enough? You want to declare it in an if?


sql/lease.go, line 613 [r30] (raw file):
comment gone


sql/lease.go, line 626 [r30] (raw file):
Done.


sql/lease.go, line 636 [r30] (raw file):
I don't think it's "extra", I think it's the only call that releases the lease taken on line 613 above


sql/schema_changer.go, line 171 [r30] (raw file):
it's needed because we want to short-circuit sc.ReleaseLease() below. Even if deletion was its own function (and in fact it pretty much is all contained in truncateAndDropTable), we'd still need to short-circuit that call. Unless we call truncateAndDropTable before the call to sc.AcquireLease above, but then we'd have to copy over the MaybeIncrementVersion call too...


sql/schema_changer.go, line 178 [r30] (raw file):
discussed offline - seems like a natural API


sql/schema_changer.go, line 598 [r30] (raw file):
well, I guess we could trigger a gossip update by calling txn.SetSystemConfigTrigger in the txn that deletes the table descriptor, but that seems fragile and async and complex to reason about. What I'm doing here seems to me like a clean solution for cleaning up these schema changers.

In the meantime, I was forced to set `systemConfigTrigger` on that txn (because of a check we have in some tests that every KV transaction that touches system keys has that flag set), but I would still keep this code.

Comments from Reviewable

@vivekmenezes
Copy link
Contributor

I'd like you to split this into four PRs:

"sql: add delete bit to table descriptor; schema change commands error on seeing the delete bit"

and another PR called

"sql: added TestingSchemaChangerCollection for ..."

and another

"sql: ensure that a lease is not taken on a deleted table"

and

"sql: delete table safely after all leases have expired"

I know this is more work but 1. you'll see these PRs getting checked in, and 2. I wont be sweating each time I have to review this.

Thanks


Review status: 0 of 26 files reviewed at latest revision, 45 unresolved discussions, some commit checks failed.


Comments from Reviewable

@andreimatei andreimatei deleted the schema-change branch April 27, 2018 18:48
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants