Architecting to Scale Module
- Concepts
- Architectural Patterns
- Loosely Coupled Architecture (LCA)
- Components can stand independently and require little or no knowledge of the inner workings of the other components.
- Benefits of LCA
- Layers of abstraction
- Permits more flexibility
- Interchangeable components
- More atomic or isolated functional units
- Can scale components independently
- Loosely Coupled Architecture (LCA)
- Types of Scaling
- Horizontal Scaling (Scale Out)
- Add more instances as demand increases
- No downtime required to Scale Out of Scale in
- Automatic using Auto-Scaling groups
- Theoretically Unlimited on AWS
- Cost Effective
- Vertical Scaling (Scale Up)
- Add more CPU and/or RAM to existing instance as demand increases
- Requires restart to scale up or down
- Would require scripting to automate
- Limited by instance sizes
- Horizontal Scaling (Scale Out)
- Auto-Scaling Groups
- Automatically provides horizontal scaling (scale-out) as required by the demand
- Triggered by an event or scaling action to either launch or terminate instances
- Availability, Cost and System metrics can all factor into scaling.
- Four Scaling Options
- Maintain
- Keep a specific or minimum number of instances running
- What: Hands-off way to maintain X number of instances
- When: I need 3 instances always
- Manual
- Use maximum, minimum or specific number of instances
- What: Manually change desired capacity via console or CLI
- When: My needs change so rarely that I can just manually add and remove
- Schedule
- Increase or Decrease instances based on Schedule
- What: Adjust min/max instances based on specific times
- When: Every Monday morning, we get a rush on our website
- Dynamic
- Scale based on real-time metrics of the systems
- What: Scale in response to behavior of elements in the environment
- When: When CPU Utilisation gets to 70% on current instances, scale up
- Maintain
- Launch Configurations
- Template used in Autoscaling group which defines
- type of EC2 instance
- Specify VPC and Subnets for scaled instances
- Attach to a ELB
- Define a Health Check Grace Period
- Define size of the group to stay at a initial size
- Or Use scaling policy which can be based from metrics (CPU, ALB Request Count Per Target, Network Traffic In/Out)
- Template used in Autoscaling group which defines
- Scaling Policies
- Target Tracking Policy
- What: Scale based on a predefined or custom metric in relation to a target value
- When: When CPU Utilisation gets to 70% on current instances, scale up
- Simple Scaling Policy
- What: Waits until health check and cool down period expires before evaluating new need
- When: Let's add new instances slow and steady
- Step Scaling Policy
- What: Responds to scaling needs with more sophistication and logic
- When: AGG! Add ALL instances
- Target Tracking Policy
- Scaling Cooldowns
- Configuration duration that gives scaling a chance to "come up to speed" and absorb load
- Default cooling period is 300 seconds
- Automatically applies to dynamic scaling and optionally to manual scaling but not supported for scheduled scaling
- Can override default cooldown via scaling-specific cool down
- Review examples
- Kinesis
- Collection of services for processing streams of various data
- Data is processed in "shards" - with each shard able to ingest 1000 records per second
- A default limit of 500 shards, but you can request an increase to unlimited shards
- Shards allow parallel processing of incoming data.
- More like lanes in a highway. More lanes, more traffic can pass through.
- The recommended method to read data from a shard is via Kinesis Consumer Library
- Record consists of Partition Key, Sequence Number and Data Blob (up to 1MB)
- Transient Data Store - Default Retention of 24 hours, but can be configured for up to 7 days
- Flavors for Kinesis
- Kinesis Video Streams
- Stream the data coming through from various sources, optimised to handle video data.
- Kinesis Data Streams
- Stream any data coming through from various sources
- Kinesis Firehose
- Receive data coming through from various sources and land it in S3, Redshift, lambda, EC2 instance etc.
- Kinesis Data Analytics
- Apply analytics on incoming data immediately.
- Kinesis Video Streams
- DynamoDB Scaling
- Divided into two dimensions
- Throughout
- Read Capacity Units
- Write Capacity Units
- Size
- Max Item Size 400KB
- Throughout
- Terminology
- Partition: A physical space where DynamoDB is stored
- Partition Key: A unique identifier for each record; sometimes called as a Hash Key
- Sort Key: In combination wtih the Partition Key, a composite key can be created that defines storage order; Sometimes called as "Range Key"
- Under the Hood
- DynamoDB scales by adding partitions.
- Partition Calculation
- By Capacity: (Total RCU / 3000) + (Total WCU / 1000)
- By Size: Total Size / 10 GB
- Total Partitions: Round Up for MAX (By Capacity, By Size)
- Example
- By Capacity: (2000 RCU / 3000) + (2000 WCU / 1000) = 2.66
- By Size: 10 GB / 10 GB = 1
- Total Partitions: MAX (2.66,1) = 2.66 Round Up = 3 Partitions
- Using Target Tracking method to try and stay close to Target Utilisation
- Currently does not scale down if table's consumption drops to Zero
- Workaround #1: Send requests to the table unit until it scales down
- Workaround #2: Manually reduce the max capacity to be the same as min capacity
- Also supports Global Secondary Indexes - think of them like a copy of the table just indexed or keyed differently
- Divided into two dimensions
- CloudFront
- Can deliver content to users faster by caching static and dynamic content at edge locations
- Dynamic Content delivery is achieved using HTTP cookies forwarded from your origin
- Supports Adobe Flash Media Server's RTMP Protocol but have to choose RTMP delivery method
- Web distributions also support media streaming and live streaming but use HTTP or HTTPS
- Origins can be S3, EC2, ELB or another web server
- Multiple origins can be configured
- User Behaviors to configure serving up origin content based on URL Paths
- Example: CloudFront CDN can configured to serve static content from a S3 bucket and dynamic content via an ELB which is backended by an EC2 fleet.
- Invalidation Requests
- There are several ways to invalidate a CloudFront cache
- Simply delete the file from the origin and wait for the TTL to expire. TTL is configurable
- Use the AWS console to request invalidation for all content or a specific path such as /images/*
- Use the CloudFront API to submit an invalidation request
- Use Third-party tools to perform CloudFront invalidation (CloudBerry, YLastic, CDN Planet, CloudFront Purge Tool)
- There are several ways to invalidate a CloudFront cache
- CloudFront Supports
- Zone Apex DNS entries (that is without www or any subdomains in front of the URL)
- Geo Restriction: White List and Black List countries to access the CloudFront content
- Amazon Simple Notification Service (SNS)
- Enables a Publish/Subscribe design pattern
- Topics: A channel for publishing a notification, consider this as an outbox
- Subscription: Configuring an endpoint to receive messages published on the topic
- Endpoint protocols include HTTP(S), email, SMS, SQS, Amazon Device Messaging (push notifications) and lambda
- Fan Out Architecture
- User Uploads an item ->
- S3 bucket ->
- SNS ->
- Image Upload Topic ->
- SES -> Thank you email
- SQS -> Image Resize Queue
- Lambda -> Amazon Rekognition
- Amazon Simple Queue Service (SQS)
- Reliable, highly-scalable, hosted message queuing service.
- Available integration with KMS for encrypted messaging.
- Transient Storage - default 4 days, max 14 days
- Optionally supports First-in First-out queueing order.
- Maximum message size of 256KB but by using a special Java SQS SDK, messages can be as large as 2GB.
- Stores the message on S3 and creates a SQS message as a pointer to the message
- Significant benefit of SQS is that it helps create Loosely coupled architectures.
- Standard Queue vs FIFO queue
- Standard Queue: Order is not guaranteed.
- FIFO Queue: Order is guaranteed, first in, first out
- Amazon MQ
- Managed implementation of Apache's ActiveMQ
- Fully Managed and highly available within a region
- Fully supports ActiveMQ API, JMS, NMS, MQTT, WebSocket etc
- Designed as a drop-in replacement for on-premise message brokers
- Use SQS if a new application is created from scratch.
- Use MQ if migrating an existing Message Brokers to AWS
- Amazon Lambda
- Allows users to run code on-demand without the need for Infrastructure
- Supports Node.js, Python, Java, Go and C#
- Code is stateless and executed on a event basis (SNS, SQS, S3, DynamoDB Streams etc)
- No fundamental limits to scaling a function since AWS dynamically allocates capacity in relation to events
- Simple Workflow Service (SWF)
- Create distributed asynchronous systems as workflows
- Supports both sequential and parallel processing.
- Tracks the state of the workflow which you interact and update via API
- Best suited for human-enabled workflows like a order fulfilment or procedural requests
- AWS recommends new applications - look at Step Functions over SWF
- Components for SWF
- Activity Worker
- A program that interacts with the AWS SWF service to get tasks, process tasks and return results
- Decider
- A program that controls coordination of tasks such as their ordering, concurrency and scheduling
- Activity Worker
- AWS Step Functions
- Managed workflow and orchestration platform
- Scalable and highly available
- Define app as a state machine
- Create tasks, sequential steps, parallel steps, branching paths or timers.
- Amazon State Language declarative JSON to configure and document the steps of step function
- Apps can interact and update the stream via Step Function API
- Visual Interface that helps describe the flow and real time status
- Detailed logs captured for all the steps
- Finite State Machine
- AWS Batch
- Management tool for creating, managing and executing batch-oriented tasks using EC2 instances
- Create a Compute Environment: Managed or Unmanaged, Spot or On-Demand, vCPUs
- Create a Job Queue with priority assigned to Compute Environment
- Create a Job Definition: Script or JSON, environment variables, mount points, IAM role, container image etc.
- Schedule the Job
- Comparisons
- Step Functions
- When: Out-of-the-box coordination of AWS service components
- Use Case: Order processing flow
- Simple Work Flow service
- When: Need to support external processes or specialised execution logic.
- Use Case: Loan Application Process with manual review steps
- Note: This is being phased out and Amazon is pushing towards adopting Step Function.
- Simple Queue Service
- When: Message Queue; Store and Forward Patterns
- Use Case: Image resize process
- AWS Batch
- When: Scheduled or reoccurring tasks that doesn't require heavy logic
- Use Case: Rotate Logs daily on Firewall Appliance
- Step Functions
- Elastic MapReduce
- This isn't one single product, rather it is a collection of OpenSource products.
- EMR Core
- Hadoop HDFS: Is just the filesystem that the data gets stored in. Conducive to data analytics and data manipulation.
- Hadoop MapReduce: Actual Framework to do the processing of the data
- EMR Management
- Zoo Keeper: Involved in Resource Co-ordination.
- Oozie: Workflow Framework
- Apache Pig: Scripting Framework
- Hive: SQL Interface to Hadoop Landscape
- Mahout: Machine Learning Component
- Apache HBase: Columnar Database for storing Hadoop data
- Flume: Very helpful for ingesting Application and System logs
- Scoop: Facilitates the import of data into Hadoop from other databases or datasources
- Ambari: Management and Monitoring console
- Enterprise Support, Professional Support and Project Contributions
- Hortonworks
- Cloudera
- EMR Core
- AWS Elastic MapReduce (EMR)
- Managed Hadoop Framework for processing huge amounts of data
- Also supports Apache Spark, HBase, Presto and Flink
- Most commonly used for log analysis, financial analysis or extract, translate and loading (ETL) activities.
- A Step is a programmatic task for performing some process on the data (eg. count words)
- A cluster is a collection of EC2 instances provisioned by EMR to run the Steps
- Components of EMR
- Core Nodes - Holds the data for processing
- Task Nodes - Worker nodes, storage is ephemeral (no persistent storage of data), works on the step
- AWS EMR Process Example
- Step 1: Hive Script
- Step 2: Custom Jar
- Step 3: Pig Script
- Step 4: Output to S3 bucket
- This isn't one single product, rather it is a collection of OpenSource products.
- Exam Tips
- Auto Scaling Groups
- Know the different scaling options and policies
- Understand the difference and limitations between horizontal and vertical scaling
- Know what cooling down period (not the same as health check grace period) is and how it impacts the responsiveness to demand
- Kinesis
- Exam is likely to restricted to the Data Stream use cases for Kinesis such as Data Streams and Firehose
- Understand shard concept and how partition keys and sequences enabled shards to manage data flow
- DynamoDB Autoscaling
- Know the new and old terminology and concept of a partition, partition key and sort key in the context of DynamoDB
- Understand how DynamoDB calculates total partitions and allocates RCU and WCU across available partitions
- Conceptually know how data is stored across partitions
- CloudFront
- Know that both static and dynamic content is supported
- Understand possible origins and how multiple origins can be used together with Behaviors
- Know invalidation methods, zone apex and geo-restriction options
- SNS
- Understand a loosely coupled architecture and benefits it brings
- Know the different types of subscription end points supported
- SQS
- Know the difference between
- Standard and FIFO Queues
- Pub/Sub (SNS) and Message Queueing (SQS) architecture
- Know the difference between
- Lambda
- Know what "serverless" is in concept and how Lambda can facilitate such an architecture
- Know the languages supported by Lambda
- SWF
- Understand the difference and functions of a Worker and Decider
- Best suited for human-enabled workflows like order fulfillment or procedural requests
- Elastic MapReduce
- Understand the parts of a Hadoop landscape at a high-level
- Know what a Cluster is and what steps are
- Understand the roles of a Master Node, Core Nodes and Task Nodes.
- Auto Scaling Groups
- Whitepapers
- Web Application Hosting in the AWS Cloud
- Introduction to Scalable Gaming Patterns on AWS
- Performance at Scale with Amazon Elasticache
- Automating Elasticity
- Videos
- Scaling up to your First 10 Million Users
- Learning to build a cloud-scale Wordpress site that can keep up with Unpredictable Changes and Capacity Demands
- Elastic Load Balancing Deep Dive and Best Practices
- Pro Tips
- Elasticity will drive most benefit from the cloud such as cost and agility
- Think Cloud-First designs if you're in a Green Field scenario even if you are deploying on-prem
- Most modern data centres are moving towards modular workloads to support things like Docker and Cloud Foundary
- These things allow us to scale horizontally just like in the cloud
- If in "Brown Field" situation, create roadmaps for cloud-first enablers like distributed applications, federated data and SOA
- Be careful to not let elasticity cover for poor development methods
- Microservice concepts help achieve scalability via decoupling, simplification and seperation of concerns
- Architectural Patterns