Skip to content

Commit

Permalink
Replace validation by download with content MD5 (#65)
Browse files Browse the repository at this point in the history
  • Loading branch information
mephenor authored Jan 10, 2025
1 parent ebc19f0 commit 3da0419
Show file tree
Hide file tree
Showing 24 changed files with 2,082 additions and 1,795 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ repos:
- id: no-commit-to-branch
args: [--branch, dev, --branch, int, --branch, main]
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.7.4
rev: v0.8.2
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
Expand Down
2 changes: 1 addition & 1 deletion .pyproject_generation/pyproject_custom.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "ghga_datasteward_kit"
version = "4.4.1"
version = "4.5.0"
description = "GHGA Data Steward Kit - A utils package for GHGA data stewards."
dependencies = [
"crypt4gh >=1.6, <2",
Expand Down
2 changes: 2 additions & 0 deletions .pyproject_generation/pyproject_template.toml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ fixable = [
"UP", # e.g. List -> list
"I", # sort imports
"D", # pydocstyle
"RUF022", # sort items in __all__
]
ignore = [
"E111", # indentation with invalid multiple (for formatter)
Expand All @@ -59,6 +60,7 @@ ignore = [
"D206", # indent-with-spaces (for formatter)
"D300", # triple-single-quotes (for formatter)
"UP040", # type statement (not yet supported by mypy)
"PLC0206", # Extracting value from dictionary without calling `.items()`
]
select = [
"C90", # McCabe Complexity
Expand Down
33 changes: 25 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,9 @@ This is achieved using the data steward kit, using the following steps:
where the command is prefixed with `legacy-`.
Please see [this section](#files-batch-upload) for further details. This will output
one summary JSON per uploaded file. The encryption secret is automatically
transferred to GHGA central.
transferred to GHGA central for the normal upload path.
For the legacy version of the commands, the encryption secret is present in the summary
JSON and will be exchanged for a secret during ingest.

Once the upload of all files of a submission has completed, please notify the GHGA
Central Data Steward and provide the summary JSONs obtained in step 2.
Expand Down Expand Up @@ -120,11 +122,13 @@ The following paragraphs provide additional help for using the different command
This command facilitates encrypting files using Crypt4GH and uploading the encrypted
content to a (remote) S3-compatible object storage.
This process consists of multiple steps:
1. Generate a unique file id
2. Create unencrypted file checksum
3. Encrypt and upload file in chunks
4. Download encrypted file content, decrypt and verify checksum
5. Write file/upload information to output file
1. Generate a unique file/object id for S3
2. Create SHA256 checksum for the unencrypted file
3. Encrypt and upload file in chunks, generate checksums for encrypted parts.
Verify successful transfer using content MD5 on the part level
4. Compute content MD5 of the assembled object locally and compare with S3
5. Decrypt local parts, generate SHA256 checksum and compare with checksum calculated in step 2
6. Write file/upload information to output file

The user needs to provide a config yaml containing information as described
[here](./s3_upload_config.md).
Expand All @@ -146,6 +150,15 @@ It contains the following information:
Attention: Keep this output file in a safe, private location.
If this file is lost, the uploaded file content becomes inaccessible.

#### About concurrency

In addition to the already existing batch upload command that allows for parallel processing and transfer on the file level, v4.3.0 added an asynchronous task handler for upload and download parallelization on the file part level.

Moving from directly downloading the uploaded file to using content MD5 for validation purposes, there is no clear preference in which mode of parallelism should be used, as no benchmarking has been done yet with the current changes.

By default, part level parallelism is set by the `client_max_parallel_transfers` to a value of 10.
If you want to disable it, the value has to be explicitly set to 1.

### files ingest-upload-metadata

*To be performed by Central Data Stewards only.*
Expand All @@ -156,8 +169,12 @@ running system and make the corresponding files available for download.

This command requires a configuration file as described [here](./ingest_config.md).

#### ingest version compatibility
Currently v4.4.0 of this tool and v4.0.0 of the `File Ingest Service` are compatible.
#### Ingest version compatibility

| Datasteward Kit Version | File Ingest Service Version |
| :---------------------: | :-------------------------: |
| >=4.5.0 | >=5.0.0 |
| >=4.4.0, <4.5.0 | >=4.0.0, <5 |

### metadata

Expand Down
1,113 changes: 560 additions & 553 deletions lock/requirements-dev.txt

Large diffs are not rendered by default.

853 changes: 431 additions & 422 deletions lock/requirements.txt

Large diffs are not rendered by default.

4 changes: 3 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ classifiers = [
"Intended Audience :: Developers",
]
name = "ghga_datasteward_kit"
version = "4.4.1"
version = "4.5.0"
description = "GHGA Data Steward Kit - A utils package for GHGA data stewards."
dependencies = [
"crypt4gh >=1.6, <2",
Expand Down Expand Up @@ -67,6 +67,7 @@ fixable = [
"UP",
"I",
"D",
"RUF022",
]
ignore = [
"E111",
Expand All @@ -86,6 +87,7 @@ ignore = [
"D206",
"D300",
"UP040",
"PLC0206",
]
select = [
"C90",
Expand Down
2 changes: 2 additions & 0 deletions s3_upload_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@

- **`selected_storage_alias`** *(string)*: Alias of the selected storage node/location. Has to match the backend configuration and must also be present in the local storage configuration. During the later ingest phase, the alias will be validated by the File Ingest Service.

- **`client_max_parallel_transfers`** *(integer)*: Maximum number of parallel transfer tasks for file parts. Exclusive minimum: 0. Default: 10.

## Definitions

- <a id="$defs/S3Config"></a>**`S3Config`** *(object)*: S3 server specific config params.
Expand Down
14 changes: 7 additions & 7 deletions src/ghga_datasteward_kit/cli/file.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,10 @@

log = logging.getLogger(__name__)

cli = typer.Typer()
cli = typer.Typer(no_args_is_help=True)


@cli.command()
@cli.command(no_args_is_help=True)
def legacy_upload(
input_path: Path = typer.Option(..., help="Local path of the input file"),
alias: str = typer.Option(..., help="A human readable file alias"),
Expand All @@ -37,7 +37,7 @@ def legacy_upload(
s3_upload.legacy_main(input_path=input_path, alias=alias, config_path=config_path)


@cli.command()
@cli.command(no_args_is_help=True)
def upload(
input_path: Path = typer.Option(..., help="Local path of the input file"),
alias: str = typer.Option(..., help="A human readable file alias"),
Expand All @@ -47,7 +47,7 @@ def upload(
s3_upload.main(input_path=input_path, alias=alias, config_path=config_path)


@cli.command()
@cli.command(no_args_is_help=True)
def legacy_batch_upload(
tsv: Path = typer.Option(
...,
Expand All @@ -73,7 +73,7 @@ def legacy_batch_upload(
)


@cli.command()
@cli.command(no_args_is_help=True)
def batch_upload(
tsv: Path = typer.Option(
...,
Expand All @@ -99,7 +99,7 @@ def batch_upload(
)


@cli.command()
@cli.command(no_args_is_help=True)
def ingest_upload_metadata(
config_path: Path = typer.Option(..., help="Path to a config YAML."),
):
Expand All @@ -117,7 +117,7 @@ def ingest_upload_metadata(
print("Successfully sent all file upload metadata for ingest.")


@cli.command()
@cli.command(no_args_is_help=True)
def delete_file(
file_id: str = typer.Option(
...,
Expand Down
6 changes: 3 additions & 3 deletions src/ghga_datasteward_kit/cli/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,12 @@
TokenNotExistError,
)

cli = typer.Typer()
cli = typer.Typer(no_args_is_help=True)
cli.add_typer(file_cli, name="files", help="File related operations.")
cli.add_typer(metadata_cli, name="metadata", help="Metadata related operations.")


@cli.command()
@cli.command(no_args_is_help=True)
def generate_catalog_accessions(
*,
store_path: Path = typer.Option(
Expand Down Expand Up @@ -65,7 +65,7 @@ def generate_catalog_accessions(
typer.echo(accession)


@cli.command()
@cli.command(no_args_is_help=True)
def load(
*,
config_path: Path = typer.Option(
Expand Down
10 changes: 5 additions & 5 deletions src/ghga_datasteward_kit/cli/metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,12 @@

from ghga_datasteward_kit import metadata

cli = typer.Typer()
cli = typer.Typer(no_args_is_help=True)

cli.command()(transpile)
cli.command(no_args_is_help=True)(transpile)


@cli.command()
@cli.command(no_args_is_help=True)
def submit(
submission_title: str = typer.Option(..., help="The title of the submission."),
submission_description: str = typer.Option(
Expand Down Expand Up @@ -61,7 +61,7 @@ def submit(
)


@cli.command()
@cli.command(no_args_is_help=True)
def generate_artifact_models(
config_path: Path = typer.Option(
...,
Expand All @@ -77,7 +77,7 @@ def generate_artifact_models(
metadata.generate_artifact_models_from_path(config_path=config_path)


@cli.command()
@cli.command(no_args_is_help=True)
def transform(
config_path: Path = typer.Option(
...,
Expand Down
3 changes: 0 additions & 3 deletions src/ghga_datasteward_kit/s3_upload/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,9 +119,6 @@ class LegacyConfig(S3ObjectStoragesConfig):
default=5,
description="Number of times a request should be retried on non critical errors.",
)
debug: bool = Field(
default=False, description="Enable debug functionality for upload."
)

@field_validator("output_dir")
def expand_env_vars_output_dir(cls, output_dir: Path): # noqa: N805
Expand Down
Loading

0 comments on commit 3da0419

Please # to comment.