Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Feature]: possibility to pause/resume on-going vecotirzer jobs #336

Open
dberardo-com opened this issue Jan 2, 2025 · 4 comments
Open

Comments

@dberardo-com
Copy link

What problem does the new feature solve?

i have a very big dataset (wikimedia.en) that i would like to vectorize, and would like the vectorizer to run only at night time.

i am downloading the whole wikimedia dataset in a single shot, but would like the vectorizer to work it out only at night.

What does the feature do?

  • load a huge dataset in one shot
  • start vectorization
  • pause vectorization on demand
  • resume vectorization on demand

Implementation challenges

No response

Are you going to work on this feature?

None

@dberardo-com
Copy link
Author

AFAIK in order to reproduce this behavior i should create a vectorizer with an EMPTY QUEUE, and then start loading rows little by litte at night time.

the problem here is that i should wait every time that the worker is done before adding more rows in the queue, and this makes less sense because the automation should be a concern of the worker itself and not a periodic job outside the worker and in timescale

@alejandrodnm
Copy link
Contributor

To stop the vectorizer worker, you just need to stop the workers. That's going to depend on what infra you're running. You can set a job that stops every vectorizer worker container you're running at a specific time of day.

If you're running everything locally, you can start the containers, then set a cronjob that runs docker stop.

On cloud we expose start and stop, but the approach on cloud is different. We import the vectorizer as a library, and use a push approach, with the DB generating http events instead of the worker polling for the vectorizers. The events are created by using ai.scheduling_timescaledb, which creates a timescale background job, and those can be stopped.

There are multiple approaches you can take. I'd go with starting and stoping the containers, that seems to be the easiest solution.

@linear linear bot closed this as completed Jan 8, 2025
@dberardo-com
Copy link
Author

There are multiple approaches you can take. I'd go with starting and stoping the containers, that seems to be the easiest solution.

the problem with this approach is that i might need to start all workers with the "-i flag" isn't it ? this does not scale that well if i dont know the number of vectorizers in advance, which is what i am aiming for.

or am i misunderstanding the approach here ?

@alejandrodnm alejandrodnm reopened this Jan 17, 2025
@alejandrodnm
Copy link
Contributor

Hey @dberardo-com if you don't want to process any vectorizer, then the solution is to stop the worker.

If you want the worker to keep working processing other vectorizers, but you want it to skip specific vectorizers, then I'd say this is correct:

i might need to start all workers with the "-i flag" isn't it

I agree this doesn't scale, for that use case.

I need to double check something. But I think this is a feature we can add.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

2 participants