Scale up Avni Cloud resources #56

1t5j0y · 2025-01-08T04:56:10Z

Need

New org with ~8000 users starts using Avni cloud on Jan 21. Need to scale resources to handle anticipated load.
Training period - 5 day period between Jan 21 - Feb 15
Regular usage - post Feb 15

Changes

AWS

Dry run

Jan 10 - Scale up on prerelease to confirm no/minimal outage and no other issues during scale up and then scale back down to current prerelease capacity.

Live

(Scheduling change ahead of time to allow CPU credits to accumulate prior to load and for the change to happen during low usage period.)
Jan 19 pm - Change Prod RDS read-write instance: db.t3.medium -> db.t3.2xlarge (8vCPU/32GB => 8x current)
Jan 19 pm - Change 'Prod machine' EC2 instance: t3.large -> t3.xlarge (4vCPU/16GB => 2x current)

Feb ? - Change Prod RDS read-write instance: db.t3.2xlarge -> db.t3.xlarge (4vCPU/16GB => 4x current)

Application

avni-server

Tune memory params to leverage available resources.

Xmx5120m => Xmx12288m

Observe application performance.

himeshr · 2025-01-08T05:18:21Z

Have one concern regarding going live with 11.0 release during this time period.
And one suggestion from Vinay, to make use of gatling scripts to invoke sync-details api for different org users and simulate load from large number of users during dry run testing.

1t5j0y · 2025-01-10T11:45:51Z

Dry Run

Scaled DB from t3.medium -> t3.large; EC2 from t3.medium -> t3.large and Xmx2048m

RDS performs a restart of the instance to apply this change and the database is unavailable for around 3 mins
EC2 instance needs to be stopped to change type - Stop, Apply Change, Restart, Deploy config change takes roughly 10 mins

Expected 10-15 mins downtime.

Performance

Ran a few performance simulation runs using gatling with users from goonj and apfodisha and we get similar performance as current prod for 18 users syncing in parallel (peak: 18 req/sec (1000/min); avg: 10 req/sec) which is roughly 3-4x our prod load. (note: we will be using higher spec machines for prod so performance should scale accordingly).
Doubling the above load to increases the time taken for the slower APIs by 3-4x.

APIs taking longer: syncDetails, locationMapping (confirmed known bottlenecks).

Also, Prerelease db was continuously under heavy load - diagnosed it to database sync activity from metabase green instance and ETL for rwb group orgs. These have been disabled.

1t5j0y · 2025-01-18T16:44:20Z

Prod scaled up as planned. Down time 22:04 to 22:12

1t5j0y · 2025-02-27T14:08:39Z

Prod RDS downscaled back to db.t3.medium
'Prod Machine' EC2 instance downscaled back to t3.large

vinayvenu added this to Avni Product Jan 8, 2025

github-project-automation bot moved this to New Issues in Avni Product Jan 8, 2025

1t5j0y moved this from New Issues to In Analysis Review in Avni Product Jan 8, 2025

1t5j0y added a commit that referenced this issue Jan 18, 2025

#56 | Bump prod Xmx

73a8ab7

1t5j0y self-assigned this Jan 18, 2025

1t5j0y moved this from In Analysis Review to In Progress in Avni Product Jan 18, 2025

1t5j0y added a commit that referenced this issue Feb 27, 2025

#56 | Bump down prod Xmx to 5120m

1a5dcfc

1t5j0y closed this as completed Feb 27, 2025

github-project-automation bot moved this from In Progress to Done in Avni Product Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale up Avni Cloud resources #56

Scale up Avni Cloud resources #56

1t5j0y commented Jan 8, 2025

himeshr commented Jan 8, 2025 •

edited

Loading

1t5j0y commented Jan 10, 2025

1t5j0y commented Jan 18, 2025

1t5j0y commented Feb 27, 2025

Scale up Avni Cloud resources #56

Scale up Avni Cloud resources #56

Comments

1t5j0y commented Jan 8, 2025

Need

Changes

AWS

Dry run

Live

Application

avni-server

himeshr commented Jan 8, 2025 • edited Loading

1t5j0y commented Jan 10, 2025

Dry Run

Performance

1t5j0y commented Jan 18, 2025

1t5j0y commented Feb 27, 2025

himeshr commented Jan 8, 2025 •

edited

Loading