Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

sql: fix perf problems with DROP TABLE/INDEX #7499

Closed
petermattis opened this issue Jun 28, 2016 · 21 comments
Closed

sql: fix perf problems with DROP TABLE/INDEX #7499

petermattis opened this issue Jun 28, 2016 · 21 comments
Assignees
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Milestone

Comments

@petermattis
Copy link
Collaborator

DROP {TABLE,INDEX} both issue a single transactional DelRange operation to delete all of the table/index data. In large tables, this creates an extremely large transaction which has various performance problems (that should be fixed separately).

Since DROP {TABLE,INDEX} have already updated the table descriptor and other metadata to make the table/index unavailable, we should be able to use the chunked-style implementation used for backfilling indexes/columns upon creation. Even better, it isn't immediately obvious why we need to use a transaction at all here. @tschottdorf would it be reasonable to perform a single non-transactional DelRange for all of a table's data?

@tbg
Copy link
Member

tbg commented Jun 28, 2016

Yes, that would be reasonable but is not trivial: we currently auto-wrap range-spanning operations in a transaction at the DistSender level (because we generally have to). It's worth allowing for something like that, though. The "trivial" solution is a flag on BatchRequest (or a special consistency mode, or we could even make INCONSISTENT mean that range ops are allowed to span ranges without requiring a txn? Those would also be heavily parallelized).
Once we have that, the nice part is that we don't have to do any upstream chunking at all.

It's also good to keep in mind that with leader-proposed Raft, these are precisely the worst case which will essentially send snapshot-sized commands through Raft, so it would be good to avoid them. Also, we don't need MVCC at all. Instead of DeleteRange, maybe we should invest in a ClearRange command which would perform the deletion as a side-effect instead.

The two paragraphs above are mostly orthogonal, so forgetting about ClearRange for now and looking at the other thing is a good way to start.

@andreimatei
Copy link
Contributor

andreimatei commented Jun 28, 2016

I was thinking we'd wait for a distributed implementation of DROP at the SQL level; it should be trivial to parallelize, having each range clear itself non-transactionally (or one txn per range). This would avoid having to introduce any special KV support.

In the meantime, we can implement the truncation in a streaming fashion, with a series of DELETE... LIMIT ls

@petermattis
Copy link
Collaborator Author

Related, TRUNCATE <table> performs a DelRange on all of the table data. I wonder if we should turn TRUNCATE into a schema change operation.

FYI, see #2003 which is related to (or a dup of) this issue.

@petermattis
Copy link
Collaborator Author

@tschottdorf Can we really avoid using MVCC when dropping index/table data? How would that interact with time-travel queries?

@tbg
Copy link
Member

tbg commented Jun 28, 2016

Ah, forgot about those. Yes, we need to keep that for time-travel queries.

What's also potentially interesting is that we didn't really need to actively delete the data if we could instead set a TTL and the GC queue would toss out the data over time, as permitted by time travel considerations. So if TTLs are anywhere on our roadmap, that's worth a thought.

@tbg
Copy link
Member

tbg commented Jun 28, 2016

Regardless of how we go about it, we should also figure out why the intents aren't proactively resolved. In an idea world, the intent range is attached to EndTransaction by TxnCoordSender and used even when the aborting EndTransaction isn't valid any more. This is better since a single ResolveIntentRange can clear out the whole swath of keyspace we're otherwise running into one by one. Worth investigating what exactly the process is.

@petermattis
Copy link
Collaborator Author

Allowing dropped table and index data to be deleted asynchronously by the GC queue is certainly attractive, though (as mentioned in #2003) we have to consider how that will eventually interact with accounting.

Not exactly sure what you're envisioning for TTLs. I think we could enhance the GC queue so that it could determine when a range of data was deleted. For example, if TableDescriptor had a deleted_at field, the GC queue could see if the table associated with the range it is considering has been deleted. This would essentially be moving per-key tombstones to a per-table tombstone.

@tbg
Copy link
Member

tbg commented Jun 28, 2016

I'm not envisioning anything for TTLs right now, but it has been requested a bunch of times and if we were to entertain it, then there would be an elegant way to remove old tables as they go out of scope by means of the GCQueue (which would be nice because that's also the safeguard of time travel queries).
Hauling the GC queue up to a SQL level seems like a project I'm not sure is worth it.

I think we'll do fine with what we have once the deletion is less completely stupid as it is now, at least for some time.

Since the current situation is pretty bad, we should do something about it. I don't know if waiting for distributed SQL is an option or even the right thing to do. At some point we'll have to figure out what the responsibilities of DistSender vs DistSQL are. I'd rather we do that sooner than later.

@petermattis
Copy link
Collaborator Author

No need to haul the GC queue up to the SQL level in order to give it knowledge of keys. For example, we could add hooks to the GC queue which the SQL level could hook into: given a span of keys, return a list of spans with corresponding deletion timestamps.

@vivekmenezes
Copy link
Contributor

@petermattis not sure what you had in mind. Another approach creates a new system table holding all the key spans that have been deleted.

@vivekmenezes
Copy link
Contributor

Talk to peter, the API will given a span of keys will return spans with corresponding deletion timestamps. The implementation will identify all table ids that fit within the span provided, and check if those tables or indexes within have been deleted.

@tbg
Copy link
Member

tbg commented Jun 28, 2016

That's possible, as is a special RPC that marks the whole range for deletion (in which case the gc will pick it up at high priority). I'd avoid upcalling to SQL because that would require KV, and it's good to minimize that (on top of being a little intransparent). Let's not get sidetracked on this GC issue.

@vivekmenezes
Copy link
Contributor

Given that we need a fix for this soon, I'm tempted to move forward on this by adding a limit to DelRange(). I think it's a useful addition to the DelRange API even outside of this issue. Any concerns with move forward on that approach.

@petermattis
Copy link
Collaborator Author

@vivekmenezes No concerns. I thought you were already moving forward with that approach.

@vivekmenezes
Copy link
Contributor

I've not been paying any attention to this issue and was expecting to get to it in two weeks. It's hurting users and I'm going to drop what I'm doing (adding checkpointing to schema changes) in favor of this issue.

@tbg
Copy link
Member

tbg commented Jul 7, 2016

SGTM. Unfortunately there isn't a trivial way to do this within storage.

On Thu, Jul 7, 2016 at 10:43 AM vivekmenezes notifications@github.com
wrote:

I've not been paying any attention to this issue and was expecting to get
to it in two weeks. It's hurting users and I'm going to drop what I'm doing
(adding checkpointing to schema changes) in favor of this issue.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7499 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AE135GrZV6wc2IIhhzxDWmy57Vs3SrzTks5qTRBvgaJpZM4JAB_y
.

-- Tobias

@vivekmenezes
Copy link
Contributor

Can I do it the same way Scan() is implemented?

@petermattis
Copy link
Collaborator Author

You mean add a maxRows parameter to DelRange? Yes, I think that should work.

@andreimatei
Copy link
Contributor

This was also on my list, but I'm glad Vivek's taking it! :)
I was thinking of leaving KV alone, and doing the limit in SQL, in whatever way the DeletedNode deals with a LIMIT clause (I guess a scan followed by point deletes). Might be some trickiness because the table descriptor cannot be read in the usual way any more in the schema changer (since it's been marked for deletion), but I'm sure we can hack around that somehow.

I'm not sure how DelRange would work with a limit... how would DistSender split it across the different KV ranges?

@petermattis petermattis modified the milestone: Q3 Jul 11, 2016
@petermattis petermattis added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Jul 11, 2016
@vivekmenezes
Copy link
Contributor

Ran a drop table on a 3M+ row table on a GCE cluster with 3 nodes

root@104.196.170.123:26257> show tables;
+--------+
| Table  |
+--------+
| blocks |
+--------+
(1 row)
root@104.196.170.123:26257> select count(*) from blocks;
+----------+
| count(*) |
+----------+
|  3458985 |
+----------+
(1 row)
root@104.196.170.123:26257> drop table blocks;
DROP TABLE
root@104.196.170.123:26257> SELECT count(*) FROM blocks;
pq: table "blocks" does not exist

while the table was being dropped in a separate window I ran

root@104.196.170.123:26257> show tables;
+--------+
| Table  |
+--------+
| blocks |
+--------+
(1 row)
root@104.196.170.123:26257> select count(*) from blocks;
pq: table is being deleted

@vivekmenezes
Copy link
Contributor

#9318 is tracking the issue with the deleted table showing up on show tables. I think it should still show up there with some deletion progress indicator

I'm closing this issue because it's fixed.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

4 participants