Skip to content

Hosting and Populating the Database

Vidya Akavoor edited this page Aug 28, 2021 · 8 revisions

Before deploying cardinal, you will need to host and populate a database for it. Below are the instructions for how to host the database on AWS's RDS service.

Upload your datasets

Before creating and populating your database, you should make sure that the datasets you want to run workflows on are stored at the appropriate storage locations.

  1. Cardinals deployed on Google Cloud:
    1. Log into your Google Cloud console.
    2. In the side menu, find and click "Cloud Storage" under the "Storage" submenu.
    3. Click "Create Bucket" at the top if you don't already have a bucket you want to use. (The name of this bucket may be referred to as the "source_bucket" in cardinal workflows)
    4. Click into your bucket and click "Upload Files" to upload a single dataset to this bucket, or "Upload Folder" to upload an entire folder of datasets. (The file path of dataset file within the bucket will be referred to as its "source_key" in cardinal workflows)
  2. Cardinals deployed on Amazon EKS:
    1. Log into your AWS console.
    2. In the "Services" menu, click "S3."
    3. Click "Create Bucket" at the top if you don't already have a bucket you want to use. (The name of this bucket may be referred to as the "source_bucket" in cardinal workflows)
    4. Click into your bucket and click "Upload." Then click either "Add files" or "Add folder" to add your dataset files to this bucket. (The file path of dataset file within the bucket will be referred to as its "source_key" in cardinal workflows)
  3. Cardinals deployed on Azure AKS:
    1. Log into your Azure console.
    2. Under "Azure resources" click "Storage Accounts" and then click "Create" if you don't already have a storage account.
    3. Once your storage account has been made, click into the account and then click "Containers" in the left-side menu under "Data Storage."
    4. Click "Container" if you don't already have a container you want to use. (The name of this container may be referred to as the "source_bucket" in cardinal workflows)
    5. Click into the container and click "Upload" to upload your dataset files to this container. (The file path of dataset file within the bucket will be referred to as its "source_key" in cardinal workflows)

Hosting the Database

  1. Log into your AWS account and navigate to the RDS section. (make sure your account has full RDS permissions)
  2. Click "Databases" on the lefthand menu.
  3. In the top right, click "Create database".
    1. Click the "Easy create" option on top, and select "MySQL" for the engine type.
    2. For now, we use a free tier instance size, but please change this to fit your needs.
    3. Name your database and set your username and password.
    4. In the "Connectivity" section, click "Yes" for Public Access.
    5. For VPC security groups, select a security group that allows inbound traffic from whatever IPs need to access your database (i.e. your machine and the cardinal servers at the very least).
    6. Click "Create".
  4. If you had AWS generate a password for you, please make sure to click "View credentials details" in the blue loading box at the top of the screen to copy your password.
  5. You can now access this database via the terminal.
    1. Click the database you just created and look under the "Connectivity and security" tab to find the database endpoint and port.
    2. Run mysql -h {endpoint} -P {port} -u {username} -p and then enter your password in the terminal to access your database.
  6. Now, you can populate the database.

Populating the database

If you have several datasets that you want to add to the database, you may find it helpful to go through the following steps to pre-populate your database before running workflows. If you only have a few datasets to add, you may deploy cardinal first and then add the datasets using the Cardinal API.

  1. First connect to your database and run the SQL script included in the repo called cardinal-creation-script.sql to create the necessary schemas/tables.
  2. The following are the tables that will be created and their schemas:

dataset (available datasets, their locations, and their parties):

Field Type Description
id int(11) auto-incremented ID of dataset internally
dataset_id varchar(150) user-provided identifier of dataset (some unique string)
source_bucket varchar(150) bucket name of where the dataset is stored (must be under the same account as your cardinal)
source_key varchar(150) file path within the bucket for your dataset
pid int(11) which party's section of the data this row represents

jiff_server (INTERNAL: all active JIFF servers - used to track running jobs):

Field Type Description
workflow_name varchar(150) name of the workflow using this JIFF server
ip_addr varchar(150) IP address of server

pod (INTERNAL: all running pods):

Field Type Description
id int(11) auto-incremented ID of pod internally
workflow_name varchar(150) name of workflow using this pod
pid int(11) party that is using this pod
ip_addr varchar(150) IP address of pod's service

pod_event_timestamp (INTERNAL: timestamps for profiling different parts of a workflow):

Field Type Description
id int(11) auto-incremented ID of row internally
workflow_name varchar(150) name of workflow whose timestamps are in this row
pid int(11) party to whom timestamps belong
jiff_server_launched time when the JIFF server was launched
service_ip_retrieved time when the congregation service IP was retrieved
exchanged_ips time when the other cardinals' IPs were both received
built_specs_configs time when all the K8s specs were built
launched_config time when the config map was launched
launched_pod time when the congregation pod was launched
pod_succeeded time when the pod succeeded
workflow_stopped time when the workflow was stopped

pod_resource_consumption (INTERNAL: how much CPU and memory the pods use):

Field Type Description
id int(11) auto-incremented ID of row internally
pid int(11) party whose stats these are
workflow_name varchar(1000) name of workflow whose stats these are
cpu_usage int(11) how much CPU was being used at a particular time
memory_usage int(11) how much memory was being used at a particular time
timestamp time time of this ping
  1. Upload your datasets to a storage bucket in the same account as your cardinal deployment.
  2. For only the dataset table, prepare a separate .csv or .xlsx file where the headers match the table's column names, that includes all the datasets you would like to include in the database. The file locations should match the paths of whatever files you uploaded in the previous step.
  3. Open a terminal and make sure you are in the top level directory of the cardinal repo. Make sure you have a .env file here, filled out as shown below and then run the following command on each of your .csv/.xlsx files to populate the database:

.env file:

MYSQL_HOST= "{endpoint}"
MYSQL_PORT= {port}
MYSQL_USER= "{username}"
MYSQL_PASSWORD= "{password}"
MYSQL_DB= "{database name}"

Populate command: python csv_script.py {path_to_csv}

  1. You are now ready to make cardinal deployments.
Clone this wiki locally