Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Decide on DB schema + implement #25

Closed
olizilla opened this issue Nov 17, 2022 · 9 comments · Fixed by #48
Closed

Decide on DB schema + implement #25

olizilla opened this issue Nov 17, 2022 · 9 comments · Fixed by #48
Assignees
Labels
P0 Critical: Tackled by core team ASAP
Milestone

Comments

@olizilla
Copy link
Contributor

Currently, uploaderDID tracks the resource that an upload is invoked with. When we are using spaces, the with is the space DID. so UploaderDID is the space DID which feels wrong.

see: https://github.com/web3-storage/upload-api/blob/75f0b5327d3dc2cface553bf60f9e9ebc17598c9/api/test/service/store.test.js#L85-L86

@alanshaw alanshaw added the P0 Critical: Tackled by core team ASAP label Nov 22, 2022
@alanshaw alanshaw changed the title uploaderDID should track the DID that issued the invocation Decide on DB schema + implement Nov 22, 2022
@olizilla
Copy link
Contributor Author

If we focus the schema change to just include the properties we need to implement the capabilities, and updating names to be consistent with our new terminology then we end up with:

store table

ucan terminology flavour - deliberately stick to the language of the invocation we are capturing

issuer resource link size origin? ucanCID insertedAt
did:key:space did:key:space bagy...1 101 bagy...0 baf...ucan 2022-11-23T...
string string string number string string string (ISO 8601)
agent DID space DID stored CAR CID CAR size bytes prev shard CAR CID invocation CID wen db record

domain mapped flavour - name columns after how we expect them to be commonly used.

agentDID spaceDID carCID size origin? ucanCID insertedAt
did:key:space did:key:space bagy...1 101 bagy...0 baf...ucan 2022-11-23T...
string string string number string string string (ISO 8601)
agent DID space DID stored CAR CID CAR size bytes prev shard CAR CID invocation CID wen db record

notes

ucanCID is the CID for the complete, validated UCAN invocation that was sent to us and caused this row to be inserted. The current proposal is to store the bytes of that UCAN in s3 keyed by it's (raw) CID.

open questions

  • which flavour do you like
  • store size as number or string
  • what other queries outside of those needed to implement the capabilities do we want to support... see @gobengo comment...

Did some looking around on how we might try to aggregate the entries of upload-api 'store' entries into a per-customer sum of total amount stored that we can use to calculate billing. Especially given that upload-api currently uses DynamoDB and that DynamoDB querying doesn't inherently support aggregation queries like sum , it's not as easy as if we used a different datastore.
– https://filecoinproject.slack.com/archives/C02BZPRS9HP/p1669074957770049

@alanshaw
Copy link
Member

  • domain mapped
  • number - no single upload will be bigger than max safe JS number
  • queries:
    • flag for CAR receivedAt?
    • deletedAt?

@gobengo
Copy link
Contributor

gobengo commented Nov 23, 2022

wrt the thing I mentioned about billing. From what I grok about the way the store protocol works, I think the aggregate we'll probably need is

  • batch - "how many bytes are stored by space X"? If possible, it would be nice to be able to answer that across many spaces at once (e.g. select sum(bytes) from uploads group by space)
  • optional real-time - an event on every store/add or store/remove so we can update the usage-based billing service in near-real-time (which will give us more ability to avoid under/over-charging).
    • dynamodb streams can definitely do this pretty easily

FWIW here is the 'usage reports' api stripe has, which we'll want to to update if we use stripe as the usage-based billing provider.

@alanshaw alanshaw added this to the w3up phase 1 milestone Nov 24, 2022
@olizilla
Copy link
Contributor Author

olizilla commented Nov 24, 2022

The proposed dynamodb schema:

/** @type import('@serverless-stack/resources').TableProps */
export const storeTableProps = {
  fields: {
    space: 'string',        // `did:key:space`
    car: 'string',          // `bagy...1`
    size: 'number',         // `101`
    origin: 'string',       // `bagy...0` (prev CAR CID. optional)
    agent: 'string',        // `did:key:agent` (issuer of ucan)
    ucan: 'string',         // `baf...ucan` (CID of invcation UCAN)
    insertedAt: 'string',   // `2022-12-24T...`
  },
  // space + car must be unique to satisfy index constraint
  primaryIndex: { partitionKey: 'space', sortKey: 'car' },
}

/** @type import('@serverless-stack/resources').TableProps */
export const uploadTableProps = {
  fields: {
    space: 'string',        // `did:key:space`
    sk: 'string',           // `root#shard` + space must be unique for dynamo index constraint
    root: 'string',         // `baf...x`
    shard: 'string',        // `bagy...1
    agent: 'string',        // `did:key:agent` (issuer of ucan)
    ucan: 'string',         // `baf...ucan` (CID of invcation UCAN)
    insertedAt: 'string',   // `2022-12-24T...`
  },
  // space + sk must be unique to satisfy index constraint
  primaryIndex: { partitionKey: 'space', sortKey: 'sk' },
}
  • uploaderDID becomes space
  • dropped DID & CID suffixes
  • align property names with the upload-client, our capability properties and the access-api data model
  • ordered such the partitionKey and sortKey come first (not that it matters) and our accounting metadata agent (who did it), ucan (CID to link to bytes of the actual invocation), and insertedAt come last.
  • use insertedAt as all we are really capturing here is the time at which we inserted the doc into the db.
  • we can add more fields. Its a document db. The only fields we have to specify here are ones referenced in an index, but more fields are given here to give the flavour.

References

@Gozala
Copy link
Contributor

Gozala commented Nov 24, 2022

Few notes:

  1. I think we need at least one more field for provider DID. Initially we don’t plan on allowing more than one storage provider, but we do want that long term. So we’d want to track it.
  2. I’m not sure agentDID is better than issuer because there’s no way to enforce that invariant and users can issue invocations with spaceDID instead.
    • I’m also not sure why do we even want to store this outside of UCAN itself. Are there any queries that would require it ?
  3. I think ucanCID / ucan shod be called invocation instead, because that would describe semantics, while UCAN describes formatting.
  4. Reason I chose not to use car as field in capabilities is because it again describes format and not semantics. We may also add support for different formats in the future so calling things by format isn’t ideal IMO. I would use same name as in capability, if that name is bad we probably should rename that in capability as well.
  5. Do we store invocations that were proved invalid ? Or denied because storage limits?
  6. We should also store response CIDs somewhere.
  7. We may get valid requests to store, but user may never actually upload to S3 don’t know how and where to track this, but we should somewhere. Also pre-signed URLs may expire and probably we should note when / if it does

@olizilla
Copy link
Contributor Author

I think we need at least one more field for provider DID...

Yep! I'm gonna leave that out from this iteration, and add it in once we tackle providers. We can add later without needing a migration.

I’m not sure agentDID is better than issuer because there’s no way to enforce that invariant

I agree. I'm happy to call it issuer

I’m also not sure why do we even want to store this (the issuer) outside of UCAN itself

I was assuming some need to show "which identity actually invoked the thing" but we could drop it for now.

I think ucanCID / ucan should be called invocation instead

I can deal with calling it invocation. ucan is so short and neat tho.

Reason I chose not to use car as field in capabilities is because it again describes format and not semantics. We may also add support for different formats in the future so calling things by format isn’t ideal IMO. I would use same name as in capability, if that name is bad we probably should rename that in capability as well.

I think folks are struggling with link as the term for a CIDlike thing. It's very very generic, where CID was clear to folks what they were dealing with. A case could be made for renaming it, but I don't know if there is appetite for more renames.

However, I'm generally in favour of having the db labels match the labels in the invocation wherever sensible. I'll rename it link here and we can see if folks want to discuss if link should change in the capability, and I'll adopt the outcome of that, if/when it happens.

Do we store invocations that were proved invalid ? Or denied because storage limits?

not currently but i think we should track that somewhere. Do you have a use-case in mind? My assumption we'd want this for at least tracking metrics, and debugging. Would an aggregated count here be sufficient or would we want dump them in a bucket, or track them in a full on db table?

We should also store response CIDs somewhere.

Can do! Is the use-case here analytics and debugging? or something more?

We may get valid requests to store, but user may never actually upload to S3 don’t know how and where to track this, but we should somewhere. Also pre-signed URLs may expire and probably we should note when / if it does

ACK. Yes, tracking "when does the ObjectCreated event occur in s3" is on my radar. Tracking the expire time for the URL is a good idea!

@olizilla
Copy link
Contributor Author

That gives us an updated dynamodb schema like so:

/** @type TableProps */
export const storeTableProps = {
  fields: {
    space: 'string',        // `did:key:space`
    link: 'string',         // `bagy...1`
    size: 'number',         // `101`
    origin: 'string',       // `bagy...0` (prev CAR CID. optional)
    issuer: 'string',       // `did:key:agent` (issuer of ucan)
    invocation: 'string',   // `baf...ucan` (CID of invcation UCAN)
    insertedAt: 'string',   // `2022-12-24T...`
  },
  // space + car must be unique to satisfy index constraint
  primaryIndex: { partitionKey: 'space', sortKey: 'link' },
}

/** @type TableProps */
export const uploadTableProps = {
  fields: {
    space: 'string',        // `did:key:space`
    sk: 'string',           // `root#shard` + space must be unique for dynamo index constraint
    root: 'string',         // `baf...x`
    shard: 'string',        // `bagy...1
    issuer: 'string',       // `did:key:agent` (issuer of ucan)
    invocation: 'string',   // `baf...ucan` (CID of invcation UCAN)
    insertedAt: 'string',   // `2022-12-24T...`
  },
  // space + sk must be unique to satisfy index constraint
  primaryIndex: { partitionKey: 'space', sortKey: 'sk' },
}

@Gozala
Copy link
Contributor

Gozala commented Nov 25, 2022

I think folks are struggling with link as the term for a CIDlike thing. It's very very generic, where CID was clear to folks what they were dealing with. A case could be made for renaming it, but I don't know if there is appetite for more renames.

However, I'm generally in favour of having the db labels match the labels in the invocation wherever sensible. I'll rename it link here and we can see if folks want to discuss if link should change in the capability, and I'll adopt the outcome of that, if/when it happens.

I agree link is not a great name & in fact fails to describe semantics. Maybe content is better ?

we also have an issue on file suggesting that small cars should be inlined instead, in which case link (and cid) would be even less meaningful.

Yep! I'm gonna leave that out from this iteration, and add it in once we tackle providers. We can add later without needing a migration.

Sounds good! Just to be clear though, it will affect primary index & we bill who hired a provider.

I can deal with calling it invocation. ucan is so short and neat tho.

I think alternatively it could be called source which is bit shorter & still meaningful.

In regards to other info I brought up. I don’t have specific use cases, it’s just all these invocations have certain state that we don’t seem to capture. It would be difficult to tell what the state of the individual invocation is, or how did invocation progressed through the system.

we also wanted to include some info in the receipt so introspection seems important

olizilla added a commit that referenced this issue Nov 28, 2022
Update the dynamoDB table schema to the following:

```js
/** @type TableProps */
export const storeTableProps = {
  fields: {
    space: 'string',        // `did:key:space`
    link: 'string',         // `bagy...1`
    size: 'number',         // `101`
    origin: 'string',       // `bagy...0` (prev CAR CID. optional)
    issuer: 'string',       // `did:key:agent` (issuer of ucan)
    invocation: 'string',   // `baf...ucan` (CID of invcation UCAN)
    insertedAt: 'string',   // `2022-12-24T...`
  },
  // space + link must be unique to satisfy index constraint
  primaryIndex: { partitionKey: 'space', sortKey: 'link' },
}

/** @type TableProps */
export const uploadTableProps = {
  fields: {
    space: 'string',        // `did:key:space`
    sk: 'string',           // `root#shard` + space must be unique for dynamo index constraint
    root: 'string',         // `baf...x`
    shard: 'string',        // `bagy...1
    issuer: 'string',       // `did:key:agent` (issuer of ucan)
    invocation: 'string',   // `baf...ucan` (CID of invcation UCAN)
    insertedAt: 'string',   // `2022-12-24T...`
  },
  // space + sk must be unique to satisfy index constraint
  primaryIndex: { partitionKey: 'space', sortKey: 'sk' },
}
```


- `uploaderDID` becomes `space`
- `payloadCID` becomes `link` to match the capability property
- `dataCID` becomes `root` to match the terminology in the
upload-client.
- `carCID` becomes `shard` to match the terminology in the
upload-client.
- adds `issuer` - the DID of the agent that issued the invocation
- adds `invocation` - the CID for the invocation ucan
- use `insertedAt` as all we are really capturing here is the time at
which we inserted the doc into the db.
- align property names with the `upload-client`, our capability
properties and the `access-api` data model
- ordered such the partitionKey and sortKey come first (not that it
matters) and our accounting metadata last
- Fixed up types 🎉

...we can add more fields. Its a document db. The only fields we have to
specify here are ones referenced in an index, but more fields are given
here to give the flavour.

### References

- https://www.dynamodbguide.com/key-concepts
- w3 ucan caps
https://github.com/web3-storage/w3protocol/blob/aef55e718eb8756ed95beb237780ef7125b3aace/packages/access/src/capabilities/upload.js#L37
- w3-access api db schema
https://github.com/web3-storage/w3protocol/blob/cc0f4f44c06a5884c6af50cf739dad6164d8e468/packages/access-api/migrations/0000_create_spaces_table.sql
- upload-client naming
https://github.com/web3-storage/w3protocol/blob/aef55e718eb8756ed95beb237780ef7125b3aace/packages/upload-client/README.md#uploadadd
- dynamodb data types
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes
- current w3up schema
https://github.com/web3-storage/w3up/wiki/(current)-Ingestion-Database-Schema
- previous proposed schema update
https://github.com/web3-storage/w3up/wiki/Desired-Ingestion-Schema

fixes #25 

License: MIT
Signed-off-by: Oli Evans <oli@protocol.ai>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
P0 Critical: Tackled by core team ASAP
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants