Skip to content

A Hands On Introduction to AWS: April 22, 2021

Saranya Canchi edited this page Apr 22, 2021 · 2 revisions

Wednesday, April 22, from 10:00am - 12:00pm PST

Instructors: Abhijna Parigi and Jose Sanchez

Moderator: Saranya Canchi

Description

This 2 hour hands-on tutorial will introduce you to creating a computer "in the cloud" and logging into it, via Amazon Web Services. We'll create a small general-purpose Linux computer, connect to it, and run a small job while discussing the concepts and technologies involved.

While we wait to get started --

  1. ✔️ Have you checked out the pre-workshop resources page?

  2. If you are on a windows computer, make sure you have Mobax term: https://mobaxterm.mobatek.net/

Hello!

I'm Abhijna Parigi and I'm joined today by Jose Sanchez. We are both part of the training and engagement team for the NIH Common Fund Data Ecosystem, a project supported by the NIH to increase data reuse and cloud computing for biomedical research.

Have you heard of the NIH Common Fund Data Ecosystem?

Put up a ✔️ for yes and a ❎ for no!

You can contact us at aaparigi@ucdavis.edu and ronsanchez@ucdavis.edu.

We have the following goals for this workshop:

  • Help you think about if and how to use cloud computers for your work!
  • Gather questions, feedback and refine the tutorial materials!

So, please ask lots of questions, and even the ones we can't answer yet we'll figure out for you!

Costs and payment

Today, everything you do will be paid for by us. In the future, if you create your own AWS account, you'll have to put your own credit card on it. We'd be happy to answer questions about how to pay for AWS.

😺 Your free login credentials will work for the next 8 hours

Workshop structure and plan

  • Brief introduction to AWS and the cloud
  • Set up an instance and connect to it
  • Install and run things in the cloud computer
  • Learn how to download output files to local machine
  • Take your questions

How to ask questions

If you have questions at any point,

  • Drop them in the chat, or
  • Direct messages to the moderator (Saranya Canchi) are welcome, or
  • Unmute yourself and ask during the workshop

We're going to use the "raise hand" reaction in zoom to make sure people are on board during the hands-on activities.

Some background

What is cloud computing?

  • Renting and use of IT services over the internet.
  • No direct, active management by the user.
  • Avoid or minimize up-front IT infrastructure cost.
  • Amazon and Google, among others, rent compute resources over the internet for money.

Why might you want to use a cloud computer?

There are lots of reasons, but basically "you need a kind of compute or network access that you don't have."

  • More memory than you have available otherwise
  • An operating system you don't have access to (Windows? Mac?)
  • Installation privileges for software
  • May not want to install brand new software on your local computer

Amazon, terminology, and logging in!

  • Amazon web services is one of the most broadly adopted cloud platforms
  • It is a hosting provider that gives you a lot of services including cloud storage and cloud compute.

Terminology:

  • Instance - a computer that is running ...somewhere in "the cloud". The important thing is that someone else is worrying about the hardware etc, so you're just renting what you need!
  • Cloud computer - same as an "instance".
  • Image - the basic computer install from which an instance is constructed. The configuration of your instance at launch is a copy of the Amazon Machine Image (AMI)
  • EC2 - elastic compute cloud.

Amazon's main compute rental service is called Elastic Compute Cloud (or EC2) and that's what we'll be showing you today.

EC2

  • Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud.
  • Basically, you rent virtual computers that are configured according to your needs and run applications and analyses on that computer.
  • Best suited for analyses that could crash your local computer. E.g. those that generate or use large output files or take too long

Advantages of using AWS

  • # process is relatively easy (you need a credit card and some patience to deal with delays in two-factor authentication)
  • Simple billing
  • Stable services with only 3-4 major outages that only lasted 2-3 hours and did not affect all customers (region-specific). A large team of employees who are on top of any problems that arise!
  • Lots of people use it, so there are a ton of resources
  • Spot instances (unused EC2 instances) - you can bid for a price. It is cheap, but your services might be terminated if someone outbids you.

Let's get started!

We will create a cloud computer - an "instance" - and then log in to it.

Log in at: https://cfde-training-workshop.signin.aws.amazon.com/console

Use your registration e-mail (see bottom of this page if you forgot!) and password CFDErocks!

Put up a ✋ on Zoom when you've successfully logged in with the workshop user credentials.

"Spinning up" instances

Checklist for hands-on walk-through

  • Select a region: geographic area where AWS has data centers
  • Pick the AMI (OS)
  • Pick an instance (T2 micro free tier!)
  • Edit security groups
  • Launch

Link to tutorial

Connecting to instances

Other ways to connect to the instance:

We have tutorials on connecting to an instance for Windows Users using MobaXterm and for Mac Users using MacOS Terminal. Please visit our "Connect to an Instance" webpage and select your OS using the tabs on the top of the page.

Installing programs and running them in the cloud

  • Install a simple bioinformatic software (FastQC)
  • Download fastq (raw RNA Sequence) data
  • Run fastqc on downloaded data
  • Transfer output files from AWS computer to local computer.

What is FastQC?

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines.

  • It provides a modular set of analyses to help you identify problems in the quality of your samples or sequence.
  • The aim of this tool is to spot issues that originate from the sequencer or in the starting library material.
  • Output of fastqc is an HTML based permanent report

FastQC Documentation

Commands to run

(explain commands)

👎 Copy+Paste does not work if you are using Safari (MacOS) to run the AWS terminal. Please use another web browser (e.g. Chrome or Firefox), or type in the commands.

  1. Update system packages:
sudo apt update
  1. Make a directory
mkdir fastq
  1. Change into the directory
cd fastq
  1. Download a fastq data file from osf.io
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz

Click the raised hand ✋ reaction if you were able to run the last command successfully and download ERR458494.fastq.gz

  1. Check if your file has been downloaded
ls -l
  1. Install FastQC
sudo apt install fastqc -y

To double check it was successful, type fastqc --version. If it returns 0.11.9, that means installation was successful.

  1. Run FastQC on the dowloaded file
fastqc ERR458494.fastq.gz
  1. view files
ls
Learn more about the commands

apt-cache search [search term 1]

  • search available software for installation

sudo apt update

  • download packaged information from all configured sources from the internet
  • This will update the package lists from all repositories in one go. Remember to do this after every added repository!

sudo apt install <program1> -y

  • install package
  • Other programs ("ncbi blast+", )

mkdir <directory name>

  • make a new directory
  • equivalent to making a new folder in Windows

cd <directory name>

  • change directory
  • equivalent to double clicking a folder

curl -L <url> <filename> -o <file.html>

  • curl stands for "Client URL"
    • transfers data to or from a network server
    • "-L" or location/link
    • "-o" output

fastqc ERR458494.fastq.gz

  • Run FastQC on ERR458494.fastq.gz
  • ERR458494.fastq.gz - "Yeast" Sample
More About FASTQC ***

Analysis Modules Documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/

What a good data file looks like https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html

What bad data looks like https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

Video Walkthrough:

FastQC tool for read data quality evaluation

Using FastQC to check the quality of high throughput sequence


Downloading data from AWS instance onto local computer

WindowsOS ### WindowsOS

MobaXterm installation

  1. Go to the MobaXterm website to download
  2. Click on "GET MOBAXTERM NOW!"
  3. The Home Edition is perfect for normal use and it is free! Click "Download now"
  4. Click on "MobaXterm Home Edition v20.6 (Portable edition)" and save as in your Downloads folder
  5. Go to Downloads folder, click on the zipped folder, click "Extract all", click "Extract"
  6. The MobaXterm application is now in the unzipped folder
  7. Click on the MobaXterm application to open it!

Connecting to instance

  1. Go back to your instance page, select it and click on "Connect". The Public DNS information you need to connect to your instance via ssh can be found in the "SSH client" tab:

  1. In MobaXterm, click on "Session"
  2. Click on "SSH"
  3. Enter the Public DNS as the "Remote host"
  4. Check box next to "Specify username" and enter "ubuntu" as the username
  5. Click the "Advanced SSH settings" tab
  6. Check box by "Use private key"
  7. Use the document icon to navigate to where you saved the private key (e.g., "amazon.pem") from AWS on your computer. It is likely on your Desktop or Downloads folder
  8. Click "OK"
  9. A terminal session should open up with a left-side panel showing the file system of our AWS instance! You can click on the FastQC html file and view in browser to open. There are also options in the panel to download files.
MacOS/Linux

MacOS

  • Start Terminal
  • Change the permissions on the .pem file for security purposes (removes read, write, and execute permissions for all users except the owner (you)
chmod og-rwx ~/Desktop/amzon.pem
  • Change directory to Desktop. Your .pem file is on your Desktop
cd ~/Desktop

Go back to your instance page, select it and click on "Connect". The information you need to connect to your instance via ssh can be found in the "SSH client" tab:

  • Use the scp command on your local terminal to copy your .html file!
scp -i <your-.pem> ubuntu@???-??-??-???-??.us-west-1.compute.amazonaws.com:/home/ubuntu/fastq/ERR458494_fastqc.html ./

-i flag points to identity file. Don't forget to change the stuff after ubuntu@ to match your instance!

Shutting down instances

When you shut down your instance, any data that is on a non-persistent disk goes away permanently. But you also stop being charged for any compute and data, too!

💡 Stopping vs hibernation vs termination

  • Stopping:

    • saves data to EBS root volume
    • only EBS data storage charges apply
    • No data transfer charges or instance usage charges
    • RAM contents not stored
  • Hibernation:

    • charged for storage of any EBS volumes
    • stores the RAM contents
    • it's like closing the lid of your laptop
  • Termination:

    • complete shutdown
    • EBS volume is detached
    • data stored in EBS root volume is lost forever
    • instance cannot be relaunched

To enable Hibernation, click the box in the Configure Instance step of the setup.

Exercise

Launch a t2.nano, Ubuntu 20.04 LTS - Focal instance in the the East US (Ohio) region. Change the root storage volume to 16 GiB and add an additional EBS volume (8 GiB).

Bonus points: Your added volume will persist after you have terminated your instance. Where can you find it?

Hint - Go to Amazon Market place and search for the "Ubuntu 20.04 LTS - Focal". Should be the first result. - Look in tab 4 called "Add Storage" to add additional storage volumes.

Bonus Module (time permitted)

Using screen

So far in this workshop, we have only encountered programs that install quickly. The analysis we ran was also pretty quick because we only ran it on one file!

In your own work, you may encounter programs that have lengthy installations, and/or you may need to analyze a large number of files.

While performing a long-running task on a remote machine, a sudden drop in your internet connection would terminate the SSH session and your work would be lost!

The screen utility provides a work-around to this problem. screen is a terminal multiplexer i.e. you can open many virtual terminals. Processes running in screen will continue to run even when the terminal is not visible, or if you get disconnected from the internet.

Commands to run

  1. Install screen:
sudo apt-get install screen
  1. Running screen
screen

Press space (twice) or enter to get the command prompt

  1. Run a code inside screen session
top
  1. Detaching screen
Press ctrl + a + d keys
  1. List screen sessions
screen -ls
  1. Reattach screen
screen -r <screen_ID>
  1. Repeat step 4 to detach

Checklist of things you learned today!

  • A little bit about AWS and cloud computing
  • How to launch an instance
  • How to connect to the instance
  • How to install and run a software program on the instance
  • How to terminate your instance

Upcoming CFDE workshops

Check our Events page for information on upcoming workshops!

You can contact us at training@cfde.atlassian.net with requests for new topics or questions about the workshops.

Additional Resources

FAQs

A note on data transfer costs

Data transfer between AWS and the Internet: Data transfer costs from AWS to the internet are highly dependent on the region. For example, for S3 buckets located in the US West (Oregon) region, the first GB/month is free and the next 9.999 TB/month cost $0.09 per GB. However, if the S3 buckets are located in the South America (São Paolo) region, the first GB/month is still free, but the next 9.999 TB/month cost $0.25 per GB.

More info here: https://www.apptio.com/blog/aws-data-transfer-costs/

Data storage https://aws.amazon.com/ebs/#/

What are the advantages of using AWS over an academic HPC?

  • Most universities don't have a HPC
  • No queues!
  • Can set up as many instances as you want (as long as you are willing to pay for it)
  • Can install anything without needing admin permissions
  • Almost no scheduled or unscheduled outages
  • Easier to set up
  • Easier to learn and get help on the internet
  • Costs more over time, but someone is paying for the HPC too!

But if you have a good HPC, please use it!

Can you set up multiple instances at once

  • Yes!
  • There is a limit per account but it is a very large number and doesn't apply to most people

Can you launch more than one instance with the same configurations?

  • Yes, there is an option to do this on the instance set up page.
  • Look in the second tab!

Can you copy an instance or share an instance with collaborators?

  • Yes, but this is not as straightforward as it seems.
  • The way to clone an instance is via snapshots

Check out our AWS discussion board for FAQs and discussion. We encourage you to post a question here !

Clone this wiki locally