-
Notifications
You must be signed in to change notification settings - Fork 0
Hosting and Populating the Database
Before deploying cardinal, you will need to host and populate a database for it. Below are the instructions for how to host the database on AWS's RDS service.
Before creating and populating your database, you should make sure that the datasets you want to run workflows on are stored at the appropriate storage locations.
- Cardinals deployed on Google Cloud:
- Log into your Google Cloud console.
- In the side menu, find and click "Cloud Storage" under the "Storage" submenu.
- Click "Create Bucket" at the top if you don't already have a bucket you want to use. (The name of this bucket may be referred to as the "source_bucket" in cardinal workflows)
- Click into your bucket and click "Upload Files" to upload a single dataset to this bucket, or "Upload Folder" to upload an entire folder of datasets. (The file path of dataset file within the bucket will be referred to as its "source_key" in cardinal workflows)
- Cardinals deployed on Amazon EKS:
- Log into your AWS console.
- In the "Services" menu, click "S3."
- Click "Create Bucket" at the top if you don't already have a bucket you want to use. (The name of this bucket may be referred to as the "source_bucket" in cardinal workflows)
- Click into your bucket and click "Upload." Then click either "Add files" or "Add folder" to add your dataset files to this bucket. (The file path of dataset file within the bucket will be referred to as its "source_key" in cardinal workflows)
- Cardinals deployed on Azure AKS:
- Log into your Azure console.
- Under "Azure resources" click "Storage Accounts" and then click "Create" if you don't already have a storage account.
- Once your storage account has been made, click into the account and then click "Containers" in the left-side menu under "Data Storage."
- Click "Container" if you don't already have a container you want to use. (The name of this container may be referred to as the "source_bucket" in cardinal workflows)
- Click into the container and click "Upload" to upload your dataset files to this container. (The file path of dataset file within the bucket will be referred to as its "source_key" in cardinal workflows)
- Log into your AWS account and navigate to the RDS section. (make sure your account has full RDS permissions)
- Click "Databases" on the lefthand menu.
- In the top right, click "Create database".
- Click the "Easy create" option on top, and select "MySQL" for the engine type.
- For now, we use a free tier instance size, but please change this to fit your needs.
- Name your database and set your username and password.
- In the "Connectivity" section, click "Yes" for Public Access.
- For VPC security groups, select a security group that allows inbound traffic from whatever IPs need to access your database (i.e. your machine and the cardinal servers at the very least).
- Click "Create".
- If you had AWS generate a password for you, please make sure to click "View credentials details" in the blue loading box at the top of the screen to copy your password.
- You can now access this database via the terminal.
- Click the database you just created and look under the "Connectivity and security" tab to find the database endpoint and port.
- Run
mysql -h {endpoint} -P {port} -u {username} -p
and then enter your password in the terminal to access your database.
- Now, you can populate the database.
If you have several datasets that you want to add to the database, you may find it helpful to go through the following steps to pre-populate your database before running workflows. If you only have a few datasets to add, you may deploy cardinal first and then add the datasets using the Cardinal API.
- First connect to your database and run the SQL script included in the repo called
cardinal-creation-script.sql
to create the necessary schemas/tables. - The following are the tables that will be created and their schemas:
dataset
(available datasets, their locations, and their parties):
Field | Type | Description |
---|---|---|
id | int(11) | auto-incremented ID of dataset internally |
dataset_id | varchar(150) | user-provided identifier of dataset (some unique string) |
source_bucket | varchar(150) | bucket name of where the dataset is stored (must be under the same account as your cardinal) |
source_key | varchar(150) | file path within the bucket for your dataset |
pid | int(11) | which party's section of the data this row represents |
jiff_server
(INTERNAL: all active JIFF servers - used to track running jobs):
Field | Type | Description |
---|---|---|
workflow_name | varchar(150) | name of the workflow using this JIFF server |
ip_addr | varchar(150) | IP address of server |
pod
(INTERNAL: all running pods):
Field | Type | Description |
---|---|---|
id | int(11) | auto-incremented ID of pod internally |
workflow_name | varchar(150) | name of workflow using this pod |
pid | int(11) | party that is using this pod |
ip_addr | varchar(150) | IP address of pod's service |
pod_event_timestamp
(INTERNAL: timestamps for profiling different parts of a workflow):
Field | Type | Description |
---|---|---|
id | int(11) | auto-incremented ID of row internally |
workflow_name | varchar(150) | name of workflow whose timestamps are in this row |
pid | int(11) | party to whom timestamps belong |
jiff_server_launched | time | when the JIFF server was launched |
service_ip_retrieved | time | when the congregation service IP was retrieved |
exchanged_ips | time | when the other cardinals' IPs were both received |
built_specs_configs | time | when all the K8s specs were built |
launched_config | time | when the config map was launched |
launched_pod | time | when the congregation pod was launched |
pod_succeeded | time | when the pod succeeded |
workflow_stopped | time | when the workflow was stopped |
pod_resource_consumption
(INTERNAL: how much CPU and memory the pods use):
Field | Type | Description |
---|---|---|
id | int(11) | auto-incremented ID of row internally |
pid | int(11) | party whose stats these are |
workflow_name | varchar(1000) | name of workflow whose stats these are |
cpu_usage | int(11) | how much CPU was being used at a particular time |
memory_usage | int(11) | how much memory was being used at a particular time |
timestamp | time | time of this ping |
- Upload your datasets to a storage bucket in the same account as your cardinal deployment.
- For only the
dataset
table, prepare a separate .csv or .xlsx file where the headers match the table's column names, that includes all the datasets you would like to include in the database. The file locations should match the paths of whatever files you uploaded in the previous step. - Open a terminal and make sure you are in the top level directory of the cardinal repo. Make sure you have a .env file here, filled out as shown below and then run the following command on each of your .csv/.xlsx files to populate the database:
.env file:
MYSQL_HOST= "{endpoint}"
MYSQL_PORT= {port}
MYSQL_USER= "{username}"
MYSQL_PASSWORD= "{password}"
MYSQL_DB= "{database name}"
Populate command: python csv_script.py {path_to_csv}
- You are now ready to make cardinal deployments.