Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Bump the indexing timeout value #68

Merged
merged 5 commits into from
Feb 12, 2024
Merged

Bump the indexing timeout value #68

merged 5 commits into from
Feb 12, 2024

Conversation

higs4281
Copy link
Member

@higs4281 higs4281 commented Feb 12, 2024

Indexing on EXT Jenkins has become troublesome, possibly because of the increased load on the server from more aggressive security scanning and heavy morning cron jobs.

Two types of errors have interrupted indexing:

  • Disconnects with S3
  • Timeouts during OpenSearch indexing

These appear to be specific to EXT Jenkins, because our DEV Jenkins instance (zusa) runs the same pipeline code against the same data, at the same time of the morning, and rarely fails to finish.

The S3 interruptions are not too disruptive, because the downloads that succeed won't have to be run again on a retry, and little time is lost.

The timeout error, however, is painful when indexing bombs late in the indexing run. In the last week, a timeout stopped indexing after 4 million complaints had been indexed, which causes the job to start over from 0.

This PR doubles the OpenSearch timeout value, which should not affect most runs, but could save the occasional late timeout.

I ran this morning's CCDB indexing using this branch, and it succeeded on the first try. That doesn't prove that the new value saved the run, but I think we should see if the new timeout reduces the churn.

Testing

In addition to test-running the new timeout value, I got the unit tests running again by upgrading python to 3.11 and adjusting the tox configs.

Indexing on EXT Jenkins has become troublesome, possibly because of the
increased load on the server from more aggressive security scanning and
heavy morning cron jobs.

Two types of errors have interrupted indexing:
- Disconnects with S3
- Timeouts during OpenSearch indexing

These appear to be specific to EXT Jenkins, because our DEV Jenkins instance
(zusa) runs the same pipeline code against the same data, at the same time
of the morning, and rarely fails to finish.

The S3 interruptions are not too disruptive, because the downloads that succeed
won't have to be run again on a retry, and little time is lost.

The timeout error, however, is painful when indexing bombs late in the indexing run.
In the last week, a timeout stopped indexing after 4 million complaints had been
indexed, which causes the job to start over from 0.

This PR doubles the OpenSearch timeout value, which should not affect most runs, but could save the occasional late timeout.

I ran this morning's CCDB indexing using this branch, and it succeeded on the first try.
That doesn't prove that the new value saved the run, but I think we should see if the new timeout reduces the churn.
@higs4281 higs4281 merged commit d5ae4e7 into main Feb 12, 2024
2 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant