-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Normalize the local_snapshots
database schema
#197
Comments
This is sort of a blocker for #106 |
Here's a downside of the normalized form. Currently a single statement selecting a single row from To replicate this with the modified schema above will require multiple statements. I think this is probably fine. It will look something like this: Get the intrinsic snapshot information:
Get the direct parents:
Iterate that statement, using every parent_id found as the next child_id Get the intrinsic snapshot information for all of the parents:
The number of queries can be cut down with various tricks ... putting the intermediate results into a temporary table or using IN against a literal set. Alternatively, some other property of the data might help. If the snapshot path is constant across the snapshot's lifetime (is it? how do renames work?) then you can take this shortcut:
Perhaps after first looking up the path from a snapshot id, if that's what you happen to have (but maybe having a path is going to be more common? I don't know.) This query gives all of the intrinsic snapshot metadata for all snapshots. Since they're for the same path, they must all have some parent/child relationship with each other, right? Then you can determine those relationships with a relatively simple join on local_snapshot_parents which I won't try to implement here. Likely there's another option where we flatten the graph into a field of the This is probably an irrelevant optimization. The other schemes should all work and be fast enough for the number of local snapshots we're realistically going to have. This might become more important when we start tracking remote snapshots in the database, though. |
I don't think there's any point optimizing any of this until we at least write down guesses of how many local snapshots there might be (or, better yet, gather numbers from real use). I would expect less than 100. Can we just recursively construct LocalSnapshots? Regarding renames: do we have any requirements on that? Can we just do what git does, approximately? (That is, there is no "rename" you just add a new Snapshot with a different name ... and if anything cares, it can work out if it was a rename by matching the content-capability). |
To be clear about the above, I'm suggesting: "there is no rename, but there is delete and create". GridSync could choose to present a file that disappears from |
I'm pretty happy to stop thinking about optimized schemas and queries at this point. I probably don't even want to really write the join-requiring version I mentioned above. I do want to normalize the schema though. For what it's worth, "normal" means something specific and technical. So, really, to be precise, I want something that at least qualifies as 1st normal form which mostly means that each column contains one value.
The reason I don't want to do this is that it I know it is wrong in the long run and dealing with it in the short run is almost as much work as fixing the schema. If we keep it, every snapshot interaction is going to have to deal with the recursive construction and that's a lot harder than a flat construction. The specific motivation for working on this now is that if we want to present a list of snapshots via the HTTP API, we should probably support pagination in that list (as discussed elsewhere, the list needs to include remote snapshots as well, so it won't be a short list for very long). Supporting pagination with a recursive construction is approximately as hard as flattening the in-db representation. The next feature after that might be specifying a sort order (snapshots ordered by creation timestamp? snapshots ordered topologically? snapshots ordered by author? etc). All of these are harder (except in something like O(n^2 log n) time) to do than flattening the in-db representation.
Looks like you're right. There are no rename user stories in the design. So we can punt on this and fix the representation later if needs be. |
Also, with a proper schema, we can query on "Snapshots that have content-cap |
Currently we have this schema:
snapshot_blob
holds all of the rest of the snapshot structure. Since this is all hidden inside a BLOB it's effectively unqueryable at the SQL level. Also, because of the way the blob is constructed, most rows of the table contain redundant information (since the parent of each snapshot is serialized along with it).Here's a different possible schema:
This leaves out metadata. Metadata could be represented in a few ways:
snapshot.created
vssnapshot.metadata[u"ctime"]
).The text was updated successfully, but these errors were encountered: