-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Reduce the number of fetches harvesting one component #475
Reduce the number of fetches harvesting one component #475
Conversation
Performance testing in my local dev environment shows ~10% improvement in processing the following maven components. Post call to: localhost:5000/requests
In addition to performance consideration, the repositories where artifacts are retrieved tend to have rate limits as well. This needs to be taken into account. As the crawler service scales and more processors (e.g. reuse, or fossology in the future) added to harvesting, this can potentially be a concern if not addressed. |
@disulliv @MichaelTsengLZ The git history seems to have changed. My previous pull request now contains reuse changes as well. This is the cleaned up version of my previous pull request . It is ready for review :) Compared to my previous pull request: |
2f917fd
to
5bb1f51
Compare
Test case 2 (payload below) also showed >10% improvement in processing time.
|
4a0f831
to
b26f5f3
Compare
Cache in progress fetch promises, cached fetched results for maven Add a unit test for gitCloner Cache fetch results from gitCloner Add a unit test for pypiFetch Cache fetch results from pypiFetch Minor refactoring Cache fetch results from npmjsFetch Add unit tests for rubyGem Cache fetch results from rubyGemFetch Cache fetch results from packagistFetch Cache fetch results from crateioFetch Cache fetch results from debianFetch Cache fetch results from goFetch Deep clone cached result on copy Cache fetch results from nugetFetch Add unit tests for podFetch Cache results from podFetch Delay fetchResult construction until end of fetch. Delay fetchResult construction and transfer the clean up of the download directory at the end of the fetch. This is to ensure when error occurs, the cleanup of the download directory will still be tracked in request. Minor refactoring Minor refactoring Remove todo to avoid merge conflict Adapt tests after merge
ScopedQueueSets contains local and global scoped queue sets. local scoped queueset holds tasks to be performed on the fetched result (package) that is currently processed and cached locally on the crawler instance. This avoid refectch and increase the cache hit. global scoped queueset is the shared queues among crawler instances. local queueset is popped prior to the global one. This ensures that cache is utilized before expiration.
Fix and add tests Allow graceful shutdown
After the scopedQueueSets is introduced, the tool tasks on the same fetched result (in the local scoped queueset) are processed consecutively. Therefore, cache ttl for the fetched result can now be reduced.
b26f5f3
to
0dcb92f
Compare
I will merge and test this tomorrow morning on the dev environment. |
In my previous changes: -nodejs application is run as PID 1 in the docker container, and -the application can handle termination signals. Therefore, --init option is not longer necessary and hence removed in docker run command.
f0e4cfc
to
ea14613
Compare
@MichaelTsengLZ Any more improvements to be made? |
@MichaelTsengLZ Is there a way to check for deadletters in the crawler? Partial harvests observed for the following components on production |
Partial harvests seem to exist prior to this commit. See: https://clearlydefined.io/definitions/sourcearchive/mavencentral/org.zkoss.zk/zhtml/9.6.0.2 https://clearlydefined.io/definitions/sourcearchive/mavencentral/software.amazon.awssdk/protocol-core/2.17.206 https://clearlydefined.io/definitions/sourcearchive/mavencentral/org.wso2.carbon/org.wso2.carbon.ui/4.7.0-beta9 https://clearlydefined.io/definitions/sourcearchive/mavencentral/org.wso2.carbon/org.wso2.carbon.core/4.7.0-beta9 |
I haven't push the your commits from dev to prod yet because I found the crawler dev is dead yesterday. I'm trying to figure out what's wrong on dev because locally run is OK.
|
In Dev env, we have local and global queue both as memory, but in dev deployment, we would have memory for local and storage queue for global queue. This is the case that needs some more work. |
PR #480 is to address the startup issue on the dev deployment. |
Cool. Looking at this right now. |
@qtomlinson PR #480 fix the issue and the App Service dev environment works well. The only issue now is that when |
@mpcen As a follow up on Michael's observation (graceful shutdown not triggered during crawler restart), a similar question was raised in 2020: graceful shutdown on Azure AppServices (Linux/Docker) via IHostApplicationLifeTime. According to documentation, App Service restart uses docker restart. During local testing, docker restart or docker stop triggers the graceful shutdown of the crawler, and One possible explanation is the default time to wait for container to exit is different. In docker stop or docker restart, the default is 10 sec and can be modified via -t. In Azure App Service, the default is 5 seconds, according to stackoverflow. The end result of missing graceful shutdown during crawler restart is that in-progress harvests before shutdown will be partial. Currently, partial harvest cases are also present during normal production run (see comment and additional comment). Users can retrigger package harvesting to work around this. What do you think? |
@qtomlinson do you know approximately how much more time we'd need to wait for the process to complete? It looks like I'd say give it a go in DEV by playing around with some numbers for the time. |
@mpcen Setting As an experiment, nginx container image was deployed, the graceful shutdown of that container was also missing in its container logs. Without logging, an alternate is to confirm whether the side effect of the graceful shutdown has occurred. A storage queue was set up with a test crawler web app on App Service to mimic the queue used in production environment. After shutdown, additional sub tasks of package harvest were observed on the shared storage queue. This means that the publishing of local in-memory tasks (from the crawler instance) to the globally shared storage queue during graceful shutdown has been triggered. The missing of shutdown sequence in the container log seems to be a logging issue in App Service. |
To overcome that the graceful shutdown may not be triggered, StorageBackedQueue was introduced and can be used as local queue. |
…#475) * Cache in progress fetch promises, cached fetched results Cache in progress fetch promises, cached fetched results for maven Add a unit test for gitCloner Cache fetch results from gitCloner Add a unit test for pypiFetch Cache fetch results from pypiFetch Minor refactoring Cache fetch results from npmjsFetch Add unit tests for rubyGem Cache fetch results from rubyGemFetch Cache fetch results from packagistFetch Cache fetch results from crateioFetch Cache fetch results from debianFetch Cache fetch results from goFetch Deep clone cached result on copy Cache fetch results from nugetFetch Add unit tests for podFetch Cache results from podFetch Delay fetchResult construction until end of fetch. Delay fetchResult construction and transfer the clean up of the download directory at the end of the fetch. This is to ensure when error occurs, the cleanup of the download directory will still be tracked in request. Minor refactoring Minor refactoring Remove todo to avoid merge conflict Adapt tests after merge * Add ScopedQueueSets ScopedQueueSets contains local and global scoped queue sets. local scoped queueset holds tasks to be performed on the fetched result (package) that is currently processed and cached locally on the crawler instance. This avoid refectch and increase the cache hit. global scoped queueset is the shared queues among crawler instances. local queueset is popped prior to the global one. This ensures that cache is utilized before expiration. * Publish requests on local queues to global upon crawler shutdown Fix and add tests Allow graceful shutdown * Minor refactor and add more tests * Update docker file to relay of shutdown signal * Add config for dispatcher.fetched cache After the scopedQueueSets is introduced, the tool tasks on the same fetched result (in the local scoped queueset) are processed consecutively. Therefore, cache ttl for the fetched result can now be reduced. * Address review comments * Removed --init option in docker run In my previous changes: -nodejs application is run as PID 1 in the docker container, and -the application can handle termination signals. Therefore, --init option is not longer necessary and hence removed in docker run command.
Summary:
-introduced FetchResult and a cache of FetchResults in the Dispatcher to prevent multiple subsequent fetches for the same coordinates. The fetch result can be reused for various code analyses: clearlydefined, licensee, scancode, reuse, and in the future fosology.
-also implemented a cache for inProgressFetches (promises) to avoid multiple concurrent fetches for the same coordinates.
Task: #464