Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Make maintenance-mode more bulletproof #186

Open
nathanielrindlaub opened this issue Apr 26, 2024 · 3 comments
Open

Make maintenance-mode more bulletproof #186

nathanielrindlaub opened this issue Apr 26, 2024 · 3 comments
Labels

Comments

@nathanielrindlaub
Copy link
Member

When we are deploying major changes to prod and need to shut down inputs temporarily, we currently set both ingestion lambda and the frontend into maintenance-mode. For the ingestion Lambda, MAINTENANCE_MODE: true will pause the creation of new image records when images are uploaded to the ingestion bucket, and instead route those images to a "parking-lot" bucket where they live until we've completed the updates and set maintenance mode back to false and then we can move those images back to the ingestion bucket for processing.

When the frontend is in maintenance mode, a splash-screen is displayed that prevents users from accessing the app.

This works ok, but it's not perfect, as we learned today. There are two main problems:

  1. if the frontend is already loaded in a browser tab on a user's computer and they haven't refreshed it, they will still be able to access and interact with the frontend (edit labels, initiate bulk uploads) until they refresh the page and their cached files are updated with MAINTENANCE_MODE: true. So we need to figure out some way to force the user to refresh the page, perhaps by using Cognito to log out all users at once? Another idea might be to set up a maintenance mode for the graphql API, so that even if a user has access to the frontend, any actions they take would get rejected by the API.
  2. users may have initiated bulk uploads before we set the ingestion lambda into maintenance mode, and if the zip was received and the batch job was started before we turn on maintenance mode, the batch would validated and unzip those images, then move them to the ingestion bucket one-by-one, at which point the ingestion lambda would move them to the parking lot bucket (because it's now in maintenance mode), and the images would sit there with S3 keys that looks like <batchId>/path/to/image.jpg. That is fine until we move them from the parking lot bucket to the ingestion bucket manually, and because there's a batchId in the key, Animl assumes it's part of a batch. However, depending on how much time has elapsed, that batch's corresponding SQS queues may have been torn down already, so inference would fail.

For now, I think the low-tech solution to that issue will be to add a step in our production deployment workflows to manually check batch logs and the DB to make sure there aren't any fresh uploads that are in progress but haven't yet been fully unzipped. In the DB, those batches would have a created: <date_time> property but wouldn't yet have uploadComplete or processingStart or ingestionComplete fields. I'm not sure what a less manual approach might look like; I'd have to think some more on that.

@nathanielrindlaub
Copy link
Member Author

@jue-henry looked into how we could use Cognito to log users out, but I am not sure we even need to log them out... we really just need to force a page refresh when the frontend is in maintenance mode. I think a solution could be:

  1. Create a Maintenance Mode parameter in the SSM Parameter Store, which both animl-ingest and animl-api could retrieve at runtime and use instead of having to hard code it in and deploy.
  2. If animl-api is in M.M., for all /external calls, throw an error early that indicates it's in M.M.
  3. on the frontend, check for that error on each call, and if it detects it, force a page reload.

So the workflow for setting the app into M.M. would be:

  1. set hard-coded M.M. variable to true in frontend config, deploy to prod, and clear Cloudfront cache
  2. check batch logs and DB for any fresh uploads that are in progress but haven't yet been unzipped.
  3. set hard-coded M.M. variable to true in frontend config, deploy to prod, and clear Cloudfront cache
  4. set the SSM M.M. param to true
  5. Wait for messages in ALL SQS queues to wind down to zero (i.e., if there's currently a bulk upload job being processed, wait for it to finish).
  6. Backup prod DB by running npm run export-db-prod from the animl-api project root.
  7. Deploy animl-api to prod
  8. Turn off IN_MAINTENANCE_MODE in SSM first and then animl-frontend (and deploy the frontend to prod, and clear cloudfront cache)
  9. Copy any images that happened to land in animl-images-parkinglot-prod while the stacks were being deployed to animl-images-ingestion-prod, and then delete them from the parking lot bucket.

@postfalk

This comment was marked as off-topic.

@nathanielrindlaub

This comment was marked as off-topic.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants