Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Store result files to s3 #77

Merged
merged 6 commits into from
May 19, 2021
Merged

Store result files to s3 #77

merged 6 commits into from
May 19, 2021

Conversation

katxiao
Copy link
Contributor

@katxiao katxiao commented May 15, 2021

If the user passes cache_dir as an s3://<bucket-name>/path/to/dir the scores, synthetic data, and error files are stored in the corresponding path of the given bucket. If aws_key and aws_secret are provided, they will be used to authenticate the s3 requests.

✅ Ran with aws credentials and a private s3 bucket, and verified that the results files were uploaded.

Resolve #81

@katxiao katxiao requested a review from csala May 15, 2021 00:14
@katxiao katxiao force-pushed the store-results-to-s3 branch 4 times, most recently from a55dd1f to ad700a9 Compare May 17, 2021 23:54
@@ -149,11 +150,71 @@ def _score_with_timeout(timeout, synthesizer, metadata, metrics, iteration):
return output


def _write_to_cache_dir(cache_dir, name, dataset_name, iteration, run_id, scores,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about changing this approach and, instead of writing a method that handles all the different situations, just write a write_file(contents, path, aws_key, aws_secret) function that knows whether to write in S3 or in a local file depending on the path prefix.
Additionally, a write_csv could be added that just dumps the DataFrame as a CSV string and then calls write_file.

With that, we can keep the code almost as it was before, doing:

base_path = str(cache_dir / f'{name}_{dataset_name}_{iteration}_{run_id}')
if scores is not None:
    write_csv(scores, base_path + '_scores.csv')
if 'synthetic_data' in output:
    synthetic_data = compress_pickle.dumps(output['synthetic_data'])
    write_file(synthetic_data, base_path + '.data.gz')
if 'exception' in output:
    write_file(output['exception'],base_path + '_error.txt')

@@ -0,0 +1,48 @@
import boto3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this module to s3.py, as there is no other place where S3 is mentioned, so there is no ambiguity

@katxiao katxiao force-pushed the store-results-to-s3 branch 3 times, most recently from 5c29ab0 to 9c5536c Compare May 18, 2021 21:01
@katxiao katxiao force-pushed the store-results-to-s3 branch from 9c5536c to ea87801 Compare May 19, 2021 17:54
)
else:
with open(path, write_mode) as f:
if write_mode == 'w':
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csala I think you're right about writing text files with write mode w instead of wb. I updated it to special-case the gzip files to write mode wb.

Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good! Let's wait for the rest of the PRs to be merged to this, and then this can go in

@katxiao katxiao force-pushed the store-results-to-s3 branch from 388014c to ea87801 Compare May 19, 2021 18:40
@katxiao katxiao changed the title Allow cache dir to be a s3 bucket Store result files to s3 May 19, 2021
@katxiao katxiao merged commit cd9c411 into master May 19, 2021
@katxiao katxiao deleted the store-results-to-s3 branch May 19, 2021 22:01
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store cache contents into an S3 bucket
2 participants