Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Scale up Avni Cloud resources #56

Closed
1t5j0y opened this issue Jan 8, 2025 · 4 comments
Closed

Scale up Avni Cloud resources #56

1t5j0y opened this issue Jan 8, 2025 · 4 comments
Assignees

Comments

@1t5j0y
Copy link
Contributor

1t5j0y commented Jan 8, 2025

Need

New org with ~8000 users starts using Avni cloud on Jan 21. Need to scale resources to handle anticipated load.
Training period - 5 day period between Jan 21 - Feb 15
Regular usage - post Feb 15

Changes

AWS

Dry run

Jan 10 - Scale up on prerelease to confirm no/minimal outage and no other issues during scale up and then scale back down to current prerelease capacity.

Live

(Scheduling change ahead of time to allow CPU credits to accumulate prior to load and for the change to happen during low usage period.)
Jan 19 pm - Change Prod RDS read-write instance: db.t3.medium -> db.t3.2xlarge (8vCPU/32GB => 8x current)
Jan 19 pm - Change 'Prod machine' EC2 instance: t3.large -> t3.xlarge (4vCPU/16GB => 2x current)

Feb ? - Change Prod RDS read-write instance: db.t3.2xlarge -> db.t3.xlarge (4vCPU/16GB => 4x current)

Application

avni-server

Tune memory params to leverage available resources.

  • Xmx5120m => Xmx12288m

Observe application performance.

@github-project-automation github-project-automation bot moved this to New Issues in Avni Product Jan 8, 2025
@1t5j0y 1t5j0y moved this from New Issues to In Analysis Review in Avni Product Jan 8, 2025
@himeshr
Copy link
Contributor

himeshr commented Jan 8, 2025

Have one concern regarding going live with 11.0 release during this time period.
And one suggestion from Vinay, to make use of gatling scripts to invoke sync-details api for different org users and simulate load from large number of users during dry run testing.

@1t5j0y
Copy link
Contributor Author

1t5j0y commented Jan 10, 2025

Dry Run

Scaled DB from t3.medium -> t3.large; EC2 from t3.medium -> t3.large and Xmx2048m

  • RDS performs a restart of the instance to apply this change and the database is unavailable for around 3 mins
  • EC2 instance needs to be stopped to change type - Stop, Apply Change, Restart, Deploy config change takes roughly 10 mins

Expected 10-15 mins downtime.

Performance

Ran a few performance simulation runs using gatling with users from goonj and apfodisha and we get similar performance as current prod for 18 users syncing in parallel (peak: 18 req/sec (1000/min); avg: 10 req/sec) which is roughly 3-4x our prod load. (note: we will be using higher spec machines for prod so performance should scale accordingly).
Doubling the above load to increases the time taken for the slower APIs by 3-4x.

APIs taking longer: syncDetails, locationMapping (confirmed known bottlenecks).

Also, Prerelease db was continuously under heavy load - diagnosed it to database sync activity from metabase green instance and ETL for rwb group orgs. These have been disabled.

1t5j0y added a commit that referenced this issue Jan 18, 2025
@1t5j0y
Copy link
Contributor Author

1t5j0y commented Jan 18, 2025

Prod scaled up as planned. Down time 22:04 to 22:12

@1t5j0y 1t5j0y self-assigned this Jan 18, 2025
@1t5j0y 1t5j0y moved this from In Analysis Review to In Progress in Avni Product Jan 18, 2025
1t5j0y added a commit that referenced this issue Feb 27, 2025
@1t5j0y
Copy link
Contributor Author

1t5j0y commented Feb 27, 2025

Prod RDS downscaled back to db.t3.medium
'Prod Machine' EC2 instance downscaled back to t3.large

@1t5j0y 1t5j0y closed this as completed Feb 27, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in Avni Product Feb 27, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants