Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Async task state tracking #148

Open
6 of 10 tasks
nathanielrindlaub opened this issue Feb 12, 2024 · 4 comments · Fixed by #173
Open
6 of 10 tasks

Async task state tracking #148

nathanielrindlaub opened this issue Feb 12, 2024 · 4 comments · Fixed by #173

Comments

@nathanielrindlaub
Copy link
Member

nathanielrindlaub commented Feb 12, 2024

We now have a growing number of potentially long-running tasks that Lambdas are not well suited to support, especially if the user is expecting a synchronous response. These include:

@ingalls recommends that we consider creating a consistent pattern for tracking these async tasks in the DB (creating a collection or collections that we update when the state of one of these processes changes and a consistent query pattern for accessing them). I think it's a great idea... right now we have 2 entirely different ways of checking the bulk upload state and the annotation state, and for most of the others we haven't yet implemented spinning the tasks off on separate infrastructure; instead, we only support them at pretty low thresholds.

For the tasks we haven't yet broken out to run async, we have a couple options:

  1. Using AWS Batch (same process as batch upload) w/ Fargate. This would be more expensive and have a much longer cold-start, but it would mean that we wouldn’t be limited to Lambda’s 15 min limit
  2. Create SQSs + Lambda workers to pull messages off and process them in separate Lambdas. Nick's main concern here is how do we make the UX make sense for tasks that run longer than 15 mins (and would require re-prompting/initiating by the user)
@postfalk
Copy link

postfalk commented Feb 12, 2024 via email

@nathanielrindlaub nathanielrindlaub mentioned this issue Mar 12, 2024
1 task
@nathanielrindlaub
Copy link
Member Author

@ingalls, I was just made aware that the process of creating, updating, or deleting deployments on cameras with large numbers of images (like 20k) will timeout, and the deployment update will fail. This is because ProjectModel.reMapImagesToDeps() has to iterate over every image in the camera to make sure it's assigned to the right deployment.

So once we get the task tracking in place I think our two priorities should be:

  1. making getStats async
  2. making reMapImagesToDeps (or all deployment CRUD ops) async

@nathanielrindlaub
Copy link
Member Author

nathanielrindlaub commented Mar 27, 2024

I think this is just about done. We've migrated getStats, exportData, exportImageErrors, andcreate/update/deleteDeployments to the new async task lambda. We decided against using the task pattern for batches as they are much more involved and deeply integrated in the rest of the code base. It's worth, however, looking at whether we can now move some of the other time-consuming operations (Deleting large numbers of images, Deleting large numbers of labels, Merging large numbers of labels) to async and increasing or removing the allowable operation threshold.

Final punchlist:

  • review access patterns / interfaces between new task handler the various DB Models they interface with. Right now there's some inconsistency and semi-confusing function names (see: Async UpdateDeployment #168 (comment))
  • review frontend task slice and see if there are any opportunities to streamline or abstract it. At the very least, generalize getTaskFailure action handler so it works for all task types.

@nathanielrindlaub
Copy link
Member Author

Capturing some ideas discussed in our software team meeting last week (11/26):

Approaches to handling long running tasks:

  • Look for opportunities to maximize the use bulk MongoDB CRUD ops like bulkWrite() and updateMany(). Situations in which we have to iterate over individual records before updating them - for example if we need to inspect certain document properties in order to determine how we'll need to update them, e.g. like we do when deleting labels from images) - is the worst-case scenario. There are, however, often creative ways to squeeze some juice out of challenging situations like that. For example, @jue-henry was able to make image deletion 15x faster by performing bulk S3 object deletion requests and grouping the MongoDB deletion calls into batches of 300 images at a time.

  • Make operations more efficient by creating high-level, meta-properties in documents that represent certain states rather than having to inspect and infer complex states from deeply nested sub-documents. A good example of this is @jue-henry's implementation of the the Image.reviewed property.

  • Set flag, perform scheduled clean-up in the background. For example, a much better approach for deleting images rather than forcing users to wait around for a long-running delete operation would be to simply set an Image.queuedForDeletion: true flag, hide those images from the frontend to prevent users from seeing/interacting with them, and then perform the actual deletion with a scheduled cleanup job. This would also allow us to support "soft-deletion" or "archiving" of images so that users can change their mind and have the chance to move the images out of the trash for some amount of time.

  • async tasks. I.e., the approach we have been taking with the async task Lambda.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants