Skip to content

Remove unique constraints on File digests #14717

Open
@dstufft

Description

@dstufft

We currently have unique constraints on the md5, sha256, and blake2_256 digests for files.

As far as I can tell, the reason for this is because legacy PyPI had a unique constraint on md5 and whenever we added additional digests we simply copied what md5 did and also included a unique constraint for them.

However, I think that this constraint is actually pretty low value for us, as far as I can think of none of our security guarantees relies on the hash being unique across all of PyPI.

The closest thing I can find that could fall into that, is we use the blake2_256 hash when constructing the path for storing files, however that path also includes the actual filename which is also guaranteed to be unique. This means that even if two files have the exact same digest, since they'll have different filenames they'll still end up stored at different paths.

We do gain some "minor" benefit from this in that people who want to upload the same artifact twice with different filenames are forced to mutate the file to do so. The most common case being repacking wheels for different platforms, which should end up with different digests because the WHEEL metadata file should be updated, and if they don't do that currently the upload will fail due to a duplicate file.

This benefit is really minor though, because we're not actually validating that they're doing it correctly, just that they've changed... something. If we really want to ensure the WHEEL metadata is correct, then we should just validate that directly.

Having unique constraints on these values does have some minor negatives in that we have to validate that the file isn't a duplicate both on the filename and on the digests, which we can't actually do in 100% of cases unless we avoid doing the duplicate check until after we've buffered the file to the local disk and computed our own hashes for it.

We could move this check to happen prior to that, except not all clients provide all digests, but to protect against duplicate file errors we have to check all digests, so we have to compute them ourselves.

We're also paying the cost to maintain an index on those columns when we otherwise don't need to.

The one place I can find where we treat the digest as an identifier is when we're syncing the file distribution metadata to BigQuery, we treat the md5 digest as a unique identifier and will only sync files whose md5 digest doesn't match a md5 digest that is already in BigQuery. I believe we should trivially be able to update that query so we treat the unique identifier as (filename, digest) instead of just digest, which i think better matches reality anyways.

Opening this issue up to give folks a place to weigh in if they think this is a bad (or good) idea.

Metadata

Metadata

Assignees

No one assigned

    Labels

    data qualitysecuritySecurity-related issues and pull requests

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions