Project to launch Python projects in AWS.
It uploads your data & code to AWS and run it on the infrastructure you need.
Almost no AWS setup required, no risk of leaving costly AWS resources behind after use.
- You've got in your machine a piece of Python code that:
- Reads data
- Process data
- Generate new data
- In order to so, it needs: code (environment dependencies) & hardware
- Sometimes we need to change both quickly: try with more/less CPU cores, GPUs, memory, add/edit dependencies, etc.
Multiple Cloud vendors offer multiple tools to use their environments, but:
- Overkill solutions
- Steep learning curve
- Changing APIs, docs...
- Costs can be hard to forecast
- Costs even when not being used
Tool to automate data managing and code execution in AWS (by now):
- Create AWS infrastructure
- Send all-what-is-needed to S3: input data, code, auxiliary files
- Build & save your Docker image in the cloud
- Run image as a container in the cloud. The container will automatically:
a. Download data from S3 to the container file system b. Run code c. Send results from the container file system to S3 - Send link to folder in S3 to user
- Destroy all the infrastructure used except the S3 bucket and the image repository
Tech stack:
- Python
- Docker
- Shell scripting
- AWS IAM: https://aws.amazon.com/es/iam/
- AWS EC2: https://aws.amazon.com/es/ec2/
- AWS ECS?: might be
- AWS SSM: https://aws.amazon.com/es/systems-manager/
- AWS ECR: https://aws.amazon.com/es/ecr/
Changing AWS APIs and doc Non-deterministic behaviours (network variability, AWS internal time to apply changes -roles, permissions...)
The project to be run must meet the following requirements:
- The script and all the imported modules must be packaged within a parent folder
- The parent folder must have a requirements.txt file
- Script must include two mandatory arguments: "-i" (input folder full path) and "-o" (output folder full path)
Allow more than one execution per AWS account (currently: names & permission conflicts) Allow to use GPUs instances Fine-grained detail of infrastructure to be kept/destroyed after running Allow to check job status Cost calculator (before execution: cost estimation. After execution: how much did it cost?) Send email when job is finished Allow using EC2 spot instances (cheaper)
Check file already exists before sending objects to S3 to minimize network traffic Implement exponential back-off in SSM calls Use non-admin account internally to minimize security leaks Use EC2 Image Builder?