Normalize the `local_snapshots` database schema #197

exarkun · 2020-07-10T13:55:38Z

Currently we have this schema:

CREATE TABLE local_snapshots
(
 path               TEXT PRIMARY KEY,             -- UTF-8 relative filepath that the snapshot represents
 snapshot_blob      BLOB                          -- a JSON blob representing the snapshot instance
);

snapshot_blob holds all of the rest of the snapshot structure. Since this is all hidden inside a BLOB it's effectively unqueryable at the SQL level. Also, because of the way the blob is constructed, most rows of the table contain redundant information (since the parent of each snapshot is serialized along with it).

Here's a different possible schema:

CREATE TABLE local_snapshots
(
 id                    INTEGER PRIMARY KEY -- unique snapshot identifier
 path               TEXT,             -- relative filepath that the snapshot represents
 author            TEXT,             -- the author name
 content_path TEXT,             -- stashed content filesystem path 
);

CREATE TABLE local_snapshot_parents
(
 child_id    INTEGER -- the local_snapshots.id of the child in the relationship
 parent_id INTEGER -- the local_snapshots.id of the parent in the relationship
);

This leaves out metadata. Metadata could be represented in a few ways:

As an opaque blob in the local_snapshots table. This has the same drawbacks as the previous opaque blob but at least it's restricted to metadata.
As first-class columns in the local_snapshots table. This is most useful from the SQL perspective. If metadata is relatively fixed (changing from release to release, perhaps, but not from snapshot to snapshot or user to user) then this is pretty good. Things like "created date" fit well here.
As elements in a separate local_metadata table with two columns - "key" and "value". This is a kind of compromise. It's queryable though only awkwardly but still allows fully dynamic metadata.
Some hybrid of the two - for example, keeping everything we want to query on as columns in the local_snapshots table and shoving the rest into a JSON blob. These two kinds of representation could be fused/split to obscure the storage representation choice from the Python application layer - unless we thought preserving it would improve the Python code (which it might... consider snapshot.created vs snapshot.metadata[u"ctime"]).

The text was updated successfully, but these errors were encountered:

exarkun · 2020-07-10T13:56:09Z

This is sort of a blocker for #106

exarkun · 2020-07-10T17:20:36Z

Here's a downside of the normalized form. Currently a single statement selecting a single row from local_snapshots returns complete information to construct the Python representation of that snapshot - including parents and their parents and so on as far as necessary.

To replicate this with the modified schema above will require multiple statements. I think this is probably fine. It will look something like this:

Get the intrinsic snapshot information:

SELECT * FROM local_snapshots WHERE id = ?

Get the direct parents:

SELECT parent_id FROM local_snapshot_parents WHERE child_id = ?

Iterate that statement, using every parent_id found as the next child_id

Get the intrinsic snapshot information for all of the parents:

SELECT * FROM local_snapshots WHERE id = ?

The number of queries can be cut down with various tricks ... putting the intermediate results into a temporary table or using IN against a literal set.

Alternatively, some other property of the data might help. If the snapshot path is constant across the snapshot's lifetime (is it? how do renames work?) then you can take this shortcut:

SELECT * FROM local_snapshots WHERE path = ?

Perhaps after first looking up the path from a snapshot id, if that's what you happen to have (but maybe having a path is going to be more common? I don't know.)

This query gives all of the intrinsic snapshot metadata for all snapshots. Since they're for the same path, they must all have some parent/child relationship with each other, right?

Then you can determine those relationships with a relatively simple join on local_snapshot_parents which I won't try to implement here.

Likely there's another option where we flatten the graph into a field of the local_snapshots or local_snapshot_parents table but if we want to support N-M parent-child relationships then I think that flattening is more complex than I want to think about (I guess you'd need a row for every path from every root to all its descendants and then some way to match against the flattened text ...).

This is probably an irrelevant optimization. The other schemes should all work and be fast enough for the number of local snapshots we're realistically going to have. This might become more important when we start tracking remote snapshots in the database, though.

meejah · 2020-07-13T18:27:25Z

I don't think there's any point optimizing any of this until we at least write down guesses of how many local snapshots there might be (or, better yet, gather numbers from real use). I would expect less than 100.

Can we just recursively construct LocalSnapshots?

Regarding renames: do we have any requirements on that? Can we just do what git does, approximately? (That is, there is no "rename" you just add a new Snapshot with a different name ... and if anything cares, it can work out if it was a rename by matching the content-capability).

meejah · 2020-07-13T18:49:52Z

To be clear about the above, I'm suggesting: "there is no rename, but there is delete and create". GridSync could choose to present a file that disappears from /foo/bar and re-appears as /quux/ding as a rename (if e.g. the content-capability is the same). In case you rename and than make changes before a snapshot happens: 🤷

exarkun · 2020-07-13T20:35:32Z

I don't think there's any point optimizing any of this until we at least write down guesses of how many local snapshots there might be (or, better yet, gather numbers from real use). I would expect less than 100.

I'm pretty happy to stop thinking about optimized schemas and queries at this point. I probably don't even want to really write the join-requiring version I mentioned above. I do want to normalize the schema though. For what it's worth, "normal" means something specific and technical. So, really, to be precise, I want something that at least qualifies as 1st normal form which mostly means that each column contains one value.

Can we just recursively construct LocalSnapshots?

The reason I don't want to do this is that it I know it is wrong in the long run and dealing with it in the short run is almost as much work as fixing the schema. If we keep it, every snapshot interaction is going to have to deal with the recursive construction and that's a lot harder than a flat construction. The specific motivation for working on this now is that if we want to present a list of snapshots via the HTTP API, we should probably support pagination in that list (as discussed elsewhere, the list needs to include remote snapshots as well, so it won't be a short list for very long). Supporting pagination with a recursive construction is approximately as hard as flattening the in-db representation. The next feature after that might be specifying a sort order (snapshots ordered by creation timestamp? snapshots ordered topologically? snapshots ordered by author? etc). All of these are harder (except in something like O(n^2 log n) time) to do than flattening the in-db representation.

Regarding renames: do we have any requirements on that? Can we just do what git does, approximately? (That is, there is no "rename" you just add a new Snapshot with a different name ... and if anything cares, it can work out if it was a rename by matching the content-capability).

Looks like you're right. There are no rename user stories in the design. So we can punt on this and fix the representation later if needs be.

meejah · 2020-07-13T21:47:09Z

So we can punt on this and fix the representation later if needs be.

Also, with a proper schema, we can query on "Snapshots that have content-cap X" 😉 (well, remote snapshots that match, anyway...)

exarkun added this to the Open Technology Fund - Phase 2 - Version Control milestone Jul 10, 2020

exarkun mentioned this issue Aug 26, 2020

normalize snapshot schema #271

Merged

exarkun closed this as completed in #271 Aug 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize the `local_snapshots` database schema #197

Normalize the `local_snapshots` database schema #197

exarkun commented Jul 10, 2020

exarkun commented Jul 10, 2020

exarkun commented Jul 10, 2020

meejah commented Jul 13, 2020

meejah commented Jul 13, 2020

exarkun commented Jul 13, 2020

meejah commented Jul 13, 2020

Normalize the local_snapshots database schema #197

Normalize the local_snapshots database schema #197

Comments

exarkun commented Jul 10, 2020

exarkun commented Jul 10, 2020

exarkun commented Jul 10, 2020

meejah commented Jul 13, 2020

meejah commented Jul 13, 2020

exarkun commented Jul 13, 2020

meejah commented Jul 13, 2020

Normalize the `local_snapshots` database schema #197

Normalize the `local_snapshots` database schema #197