Remove unique constraints on File digests

We currently have unique constraints on the `md5`, `sha256`, and `blake2_256` digests for files.

As far as I can tell, the reason for this is because legacy PyPI had a unique constraint on `md5` and whenever we added additional digests we simply copied what `md5` did and also included a unique constraint for them.

However, I think that this constraint is actually pretty low value for us, as far as I can think of none of our security guarantees relies on the hash being unique across all of PyPI.

The closest thing I can find that *could* fall into that, is we use the `blake2_256` hash when constructing the path for storing files, however that path also includes the actual filename which is *also* guaranteed to be unique. This means that even if two files have the exact same digest, since they'll have different filenames they'll still end up stored at different paths.

We do gain some "minor" benefit from this in that people who want to upload the same artifact twice with different filenames are forced to mutate the file to do so. The most common case being repacking wheels for different platforms, which should end up with different digests because the `WHEEL` metadata file should be updated, and if they don't do that currently the upload will fail due to a duplicate file.

This benefit is really minor though, because we're not actually validating that they're doing it correctly, just that they've changed... something. If we really want to ensure the `WHEEL` metadata is correct, then we should just validate that directly.

Having unique constraints on these values does have some minor negatives in that we have to validate that the file isn't a duplicate both on the filename _and_ on the digests, which we can't actually do in 100% of cases unless we avoid doing the duplicate check until after we've buffered the file to the local disk and computed our own hashes for it.

We could move this check to happen prior to that, except not all clients provide all digests, but to protect against duplicate file errors we have to check all digests, so we have to compute them ourselves.

We're also paying the cost to maintain an index on those columns when we otherwise don't need to.

The one place I can find where we treat the digest as an identifier is when we're syncing the file distribution metadata to BigQuery, we treat the md5 digest as a unique identifier and will only sync files whose md5 digest doesn't match a md5 digest that is already in BigQuery. I believe we should trivially be able to update that query so we treat the unique identifier as `(filename, digest)` instead of just `digest`, which i think better matches reality anyways.

Opening this issue up to give folks a place to weigh in if they think this is a bad (or good) idea.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove unique constraints on File digests #14717

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Remove unique constraints on File digests #14717

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions