Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Reduce the number of fetches harvesting one component #475

Merged
merged 8 commits into from
Jul 22, 2022

Conversation

qtomlinson
Copy link
Collaborator

@qtomlinson qtomlinson commented Jun 20, 2022

Summary:

  1. Implemented Caching fetchResult and in progress fetches.
    -introduced FetchResult and a cache of FetchResults in the Dispatcher to prevent multiple subsequent fetches for the same coordinates. The fetch result can be reused for various code analyses: clearlydefined, licensee, scancode, reuse, and in the future fosology.
    -also implemented a cache for inProgressFetches (promises) to avoid multiple concurrent fetches for the same coordinates.
  2. Added ScopedQueueSets for local and global scoped queue sets.
  • local scoped queueset holds tasks to be performed on the fetched result (package) that is currently processed and cached locally on the crawler instance. This avoid refectch and increase the cache hit.
  • global scoped queueset is the shared queues among crawler instances.
  • local queueset is popped prior to the global one. This ensures that cache is utilized before expiration.

Task: #464

@qtomlinson
Copy link
Collaborator Author

qtomlinson commented Jun 20, 2022

Performance testing in my local dev environment shows ~10% improvement in processing the following maven components.
Average before change: 237 sec
Average after change: 212 sec

Post call to: localhost:5000/requests
Payload:

[
    {
        "type": "component", 
        "url": "cd:/maven/mavencentral/org.apache.httpcomponents/httpcore/4.3"
    },    {
        "type": "component", 
        "url": "cd:/maven/mavencentral/org.apache.httpcomponents/httpcore/4.3.1"
    },    {
        "type": "component", 
        "url": "cd:/maven/mavencentral/org.apache.httpcomponents/httpcore/4.3.2"
    },    {
        "type": "component", 
        "url": "cd:/maven/mavencentral/org.apache.httpcomponents/httpcore/4.3.3"
    }
]

In addition to performance consideration, the repositories where artifacts are retrieved tend to have rate limits as well. This needs to be taken into account. As the crawler service scales and more processors (e.g. reuse, or fossology in the future) added to harvesting, this can potentially be a concern if not addressed.

@qtomlinson qtomlinson marked this pull request as ready for review June 20, 2022 19:55
@qtomlinson
Copy link
Collaborator Author

qtomlinson commented Jun 20, 2022

@disulliv @MichaelTsengLZ The git history seems to have changed. My previous pull request now contains reuse changes as well. This is the cleaned up version of my previous pull request . It is ready for review :)

Compared to my previous pull request:
-no additional changes.
-All the cached related commits are squashed as one: "Cache in progress fetch promises, cached fetched results"
-The three commits related to "publish to global queue on crawler shutdown" is squashed as "Publish requests on local queues to global upon crawler shutdown"

@qtomlinson
Copy link
Collaborator Author

qtomlinson commented Jun 21, 2022

Test case 2 (payload below) also showed >10% improvement in processing time.
Data correctness is also confirmed by validating harvested data against that generated by the master branch. Aside from timestamps and local temporary directory names, there are no differences in harvested data.

[
    {
        "type": "component", 
        "url": "cd:/maven/mavencentral/org.apache.httpcomponents/httpcore/4.3"
    },    {
        "type": "component", 
        "url": "cd:/maven/gradleplugin/io.github.lognet/grpc-spring-boot-starter-gradle-plugin/4.6.0"
    },    {
        "type": "component", 
        "url": "cd:/maven/mavengoogle/android.arch.lifecycle/common/1.0.1"
    }, {
        "type": "component", 
        "url": "cd:/crate/cratesio/-/bitflags/1.0.4"
    },    {
        "type": "component", 
        "url": "cd:/npm/npmjs/-/redis/0.1.0"
    }, {
        "type": "component", 
        "url": "cd:/git/github/bitflags/bitflags/518aaf91494e94f41651a40f1b38d6ab522b0235"
    }, {
        "type": "component", 
        "url": "cd:/pypi/pypi/-/backports.ssl_match_hostname/3.7.0.1"
    }, {
        "type": "component", 
        "url": "cd:/gem/rubygems/-/small/0.4"
    }, {
        "type": "component", 
        "url": "cd:/composer/packagist/symfony/polyfill-mbstring/1.11.0"
    }, {
        "type": "component", 
        "url": "cd:/go/golang/rsc.io/quote/v1.3.0"
    }, {
        "type": "component", 
        "url": "cd:/nuget/nuget/-/xunit.core/2.4.1"
    }, {
        "type": "component", 
        "url": "cd:/pod/cocoapods/-/SoftButton/0.1.0"
    }, {
	 "type": "deb",
	"url": "cd:/deb/debian/-/mini-httpd/1.30-0.2_arm64"
    }	
]

README.md Outdated Show resolved Hide resolved
ghcrawler/providers/queuing/scopedQueueSets.js Outdated Show resolved Hide resolved
Cache in progress fetch promises, cached fetched results for maven

Add a unit test for gitCloner

Cache fetch results from gitCloner

Add a unit test for pypiFetch

Cache fetch results from pypiFetch

Minor refactoring

Cache fetch results from npmjsFetch

Add unit tests for rubyGem

Cache fetch results from rubyGemFetch

Cache fetch results from packagistFetch

Cache fetch results from crateioFetch

Cache fetch results from debianFetch

Cache fetch results from goFetch

Deep clone cached result on copy

Cache fetch results from nugetFetch

Add unit tests for podFetch

Cache results from podFetch

Delay fetchResult construction until end of fetch.

Delay fetchResult construction and transfer the clean up of the download directory at the end of the fetch.
This is to ensure when error occurs, the cleanup of the download directory will still be tracked in request.

Minor refactoring

Minor refactoring

Remove todo to avoid merge conflict

Adapt tests after merge
ScopedQueueSets contains local and global scoped queue sets.
local scoped queueset holds tasks to be performed on the fetched result (package) that is currently processed and cached locally on the crawler instance.  This avoid refectch and increase the cache hit.
global scoped queueset is the shared queues among crawler instances.
local queueset is popped prior to the global one.  This ensures that cache is utilized before expiration.
Fix and add tests

Allow graceful shutdown
After the scopedQueueSets is introduced, the tool tasks on the same fetched result (in the local scoped queueset) are processed consecutively.
Therefore, cache ttl for the fetched result can now be reduced.
@MichaelTsengLZ
Copy link
Contributor

MichaelTsengLZ commented Jul 12, 2022

I will merge and test this tomorrow morning on the dev environment.

@qtomlinson qtomlinson marked this pull request as draft July 14, 2022 18:18
In my previous changes:
-nodejs application is run as PID 1 in the docker container, and
-the application can handle termination signals.

Therefore, --init option is not longer necessary and hence removed in docker run command.
@qtomlinson qtomlinson marked this pull request as ready for review July 14, 2022 19:30
@qtomlinson
Copy link
Collaborator Author

@MichaelTsengLZ Any more improvements to be made?

@MichaelTsengLZ MichaelTsengLZ merged commit 661d709 into clearlydefined:master Jul 22, 2022
@qtomlinson qtomlinson deleted the qt/reduce_fetch branch July 25, 2022 14:28
@qtomlinson
Copy link
Collaborator Author

@MichaelTsengLZ Is there a way to check for deadletters in the crawler? Partial harvests observed for the following components on production
[{
"type": "component",
"url": "cd:/pypi/pypi/-/numba/0.56.0"
},
{
"type": "component",
"url": "cd:/nuget/nuget/-/Microsoft.VisualStudio.DiagnosticsHub.CorProfiler/17.4.32726.1"
},
{
"type": "component",
"url": "cd:/sourcearchive/mavencentral/com.google.dagger/hilt-android/2.43"
},
{
"type": "component",
"url": "cd:/nuget/nuget/-/Microsoft.VSSDK.CompatibilityAnalyzer/17.2.2197"
},
{
"type": "component",
"url": "cd:/composer/packagist/cakephp/cakephp/4.4.3"
},
{
"type": "component",
"url": "cd:/sourcearchive/mavencentral/org.eclipse.leshan/leshan-client-cf/2.0.0-M8"
},
{
"type": "component",
"url": "cd:/sourcearchive/mavencentral/org.eclipse.leshan/leshan-server-cf/1.4.1"
},
{
"type": "component",
"url": "cd:/nuget/nuget/-/Microsoft.IdentityModel.Tokens/6.22.0"
},
{
"type": "component",
"url": "cd:/pypi/pypi/-/pytorch-ignite/0.5.0.dev20220727"
}]
These ran fine in my local dev environment. If some of the processors failed, those should be recorded in deadletters.

@MichaelTsengLZ
Copy link
Contributor

I haven't push the your commits from dev to prod yet because I found the crawler dev is dead yesterday. I'm trying to figure out what's wrong on dev because locally run is OK.

2022-07-27T22:30:29.200Z INFO  - Initiating warmup request to container cdcrawler-dev_0_359c8f30 for site cdcrawler-dev

2022-07-27T22:30:44.269Z INFO  - Waiting for response to warmup request for container cdcrawler-dev_0_359c8f30. Elapsed time = 15.0689425 sec

2022-07-27T22:30:30.636401135Z [I] appInitStart {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.637419541Z [I] creating refreshing options with crawlerName:crawler {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.638018644Z [I] creating refreshing options crawler with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.638622048Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.640544260Z [I] creating refreshing options filter with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.641123463Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.641669267Z [I] creating refreshing options fetch with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.642227270Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.642798474Z [I] creating refreshing options process with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.643492678Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.644082081Z [I] creating refreshing options queue with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.644620985Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.645202988Z [I] creating refreshing options store with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.645242788Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.645777292Z [I] creating refreshing options deadletter with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.645912692Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.646548296Z [I] creating refreshing options lock with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.646648497Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.657678264Z (node:1) Warning: Accessing non-existent property 'padLevels' of module exports inside circular dependency

2022-07-27T22:30:30.657723864Z (Use `node --trace-warnings ...` to show where the warning was created)

2022-07-27T22:30:30.659623976Z [I] got refreshingOption values for crawler {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.660282980Z [I] got refreshingOption values for filter {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.660863083Z [I] got refreshingOption values for fetch {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.661446187Z [I] got refreshingOption values for process {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.662254292Z [I] got refreshingOption values for queue {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.662857895Z [I] got refreshingOption values for store {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.663532399Z [I] got refreshingOption values for deadletter {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.664121603Z [I] got refreshingOption values for lock {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.668645130Z [I] filter options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.669199734Z [I] lock options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.669836538Z [I] crawler options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.670461741Z [I] fetch options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.670477742Z [I] process options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.670483342Z [I] queue options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.671104645Z [I] store options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.671121145Z [I] deadletter options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.671692049Z [I] created all refreshingOptions {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.672271152Z [I] creating crawler {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.672352353Z [I] creating queue:storageQueue {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.679792798Z [I] creating queue:memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.680403702Z Service initialization error: Cannot read properties of undefined (reading 'on')

2022-07-27T22:30:30.682179613Z TypeError: Cannot read properties of undefined (reading 'on')

2022-07-27T22:30:30.682198313Z     at new QueueSet (/opt/service/ghcrawler/providers/queuing/queueSet.js:9:26)

2022-07-27T22:30:30.682203913Z     at Function.createQueueSet (/opt/service/ghcrawler/crawlerFactory.js:218:12)

2022-07-27T22:30:30.682254813Z     at module.exports (/opt/service/ghcrawler/providers/queuing/memoryFactory.js:14:25)

2022-07-27T22:30:30.682262213Z     at Function._getProvider (/opt/service/ghcrawler/crawlerFactory.js:156:26)

2022-07-27T22:30:30.682266213Z     at Function.createQueues (/opt/service/ghcrawler/crawlerFactory.js:210:27)

2022-07-27T22:30:30.682270013Z     at Function.createScopedQueueSets (/opt/service/ghcrawler/crawlerFactory.js:223:40)

2022-07-27T22:30:30.682273813Z     at Function.createCrawler (/opt/service/ghcrawler/crawlerFactory.js:62:39)

2022-07-27T22:30:30.682277613Z     at /opt/service/ghcrawler/crawlerFactory.js:34:38

2022-07-27T22:30:30.682281413Z     at processTicksAndRejections (node:internal/process/task_queues:96:5)

2022-07-27T22:30:30.682443714Z Error initializing the Express app: TypeError: Cannot read properties of undefined (reading 'on')

2022-07-27T22:30:30.688706852Z trackException:

2022-07-27T22:30:30.688726252Z Error: TypeError: Cannot read properties of undefined (reading 'on')

2022-07-27T22:30:30.688732152Z     at /opt/service/ghcrawler/bin/www.js:31:13

2022-07-27T22:30:30.688736252Z     at /opt/service/node_modules/express-init/lib/index.js:50:14

2022-07-27T22:30:30.688740152Z     at /opt/service/node_modules/async/dist/async.js:2959:19

2022-07-27T22:30:30.688743952Z     at wrapper (/opt/service/node_modules/async/dist/async.js:272:20)

2022-07-27T22:30:30.688748052Z     at iterateeCallback (/opt/service/node_modules/async/dist/async.js:417:21)

2022-07-27T22:30:30.688751952Z     at /opt/service/node_modules/async/dist/async.js:325:20

2022-07-27T22:30:30.688755852Z     at /opt/service/node_modules/async/dist/async.js:2957:17

2022-07-27T22:30:30.688759652Z     at /opt/service/node_modules/express-init/lib/index.js:36:20

2022-07-27T22:30:30.688763453Z     at /opt/service/ghcrawler/app.js:50:9

2022-07-27T22:30:30.688767153Z     at processTicksAndRejections (node:internal/process/task_queues:96:5)



2022-07-27T22:30:46.292Z ERROR - Container cdcrawler-dev_0_359c8f30 for site cdcrawler-dev has exited, failing site start

2022-07-27T22:30:46.337Z ERROR - Container cdcrawler-dev_0_359c8f30 didn't respond to HTTP pings on port: 5000, failing site start. See container logs for debugging.

2022-07-27T22:30:46.344Z INFO  - Stopping site cdcrawler-dev because it failed during startup.

@qtomlinson
Copy link
Collaborator Author

In Dev env, we have local and global queue both as memory, but in dev deployment, we would have memory for local and storage queue for global queue. This is the case that needs some more work.

@qtomlinson
Copy link
Collaborator Author

PR #480 is to address the startup issue on the dev deployment.

@MichaelTsengLZ
Copy link
Contributor

Cool. Looking at this right now.

@MichaelTsengLZ
Copy link
Contributor

MichaelTsengLZ commented Jul 29, 2022

@qtomlinson PR #480 fix the issue and the App Service dev environment works well. The only issue now is that when
I restarted the App Service, there was no log saying Server closed.. This means your graceful shutdown process won't work. I think Azure App Service doesn't use docker stop ${container} to stop the crawler.

@qtomlinson
Copy link
Collaborator Author

@mpcen As a follow up on Michael's observation (graceful shutdown not triggered during crawler restart), a similar question was raised in 2020: graceful shutdown on Azure AppServices (Linux/Docker) via IHostApplicationLifeTime.

According to documentation, App Service restart uses docker restart. During local testing, docker restart or docker stop triggers the graceful shutdown of the crawler, and Received SIGTERM (start of shutdown) and Server closed (end of shutdown) are noted in the logs. In logs from a crawler container deployed on App Service, however, Received SIGTERM and Server closed are both missing during webapp restart and stop. Setting verbose level logging either via az webapp start --verbose or az webapp log config does not yield more information. Any suggestion on further investigation?

One possible explanation is the default time to wait for container to exit is different. In docker stop or docker restart, the default is 10 sec and can be modified via -t. In Azure App Service, the default is 5 seconds, according to stackoverflow. WEBSITES_CONTAINER_STOP_TIME_LIMIT can potentially be used to configure the wait. This setting is yet to be documented at Environment variables and app settings in Azure App Service, and it may be worth trying when the setting is released.

The end result of missing graceful shutdown during crawler restart is that in-progress harvests before shutdown will be partial. Currently, partial harvest cases are also present during normal production run (see comment and additional comment). Users can retrigger package harvesting to work around this. What do you think?

@mpcen
Copy link
Member

mpcen commented Aug 22, 2022

@qtomlinson do you know approximately how much more time we'd need to wait for the process to complete? It looks like WEBSITES_CONTAINER_STOP_TIME_LIMIT has been rolled out but just hasn't been documented. My only concern with this approach would be a potential application deadlock (if the waiting period is long) but maybe this won't be an issue since we have multiple crawler instances?

I'd say give it a go in DEV by playing around with some numbers for the time.

@qtomlinson
Copy link
Collaborator Author

qtomlinson commented Sep 14, 2022

@mpcen Setting WEBSITES_CONTAINER_STOP_TIME_LIMIT longer (e.g. 20s or higher), did not seem to help: shutdown logging is missing in the container logs. Shutdown sequence was occasionally visible in streaming logs (
/appsvctmp/volatile/logs/runtime/6a12ea3ebdd5e50816da5d11fb57c328aaa6d8848f95df0a50a4a69a62c9b329.log), but not in the container logs accessible via public api. The intriguing piece is that the beginning of the shutdown is missing in the log. If the shutdown time is not enough, the beginning of the shutdown should have been visible in the log.

As an experiment, nginx container image was deployed, the graceful shutdown of that container was also missing in its container logs.

Without logging, an alternate is to confirm whether the side effect of the graceful shutdown has occurred. A storage queue was set up with a test crawler web app on App Service to mimic the queue used in production environment. After shutdown, additional sub tasks of package harvest were observed on the shared storage queue. This means that the publishing of local in-memory tasks (from the crawler instance) to the globally shared storage queue during graceful shutdown has been triggered.

The missing of shutdown sequence in the container log seems to be a logging issue in App Service.

@qtomlinson
Copy link
Collaborator Author

qtomlinson commented Nov 19, 2022

To overcome that the graceful shutdown may not be triggered, StorageBackedQueue was introduced and can be used as local queue.

qtomlinson added a commit to qtomlinson/crawler that referenced this pull request Feb 6, 2024
…#475)

* Cache in progress fetch promises, cached fetched results

Cache in progress fetch promises, cached fetched results for maven

Add a unit test for gitCloner

Cache fetch results from gitCloner

Add a unit test for pypiFetch

Cache fetch results from pypiFetch

Minor refactoring

Cache fetch results from npmjsFetch

Add unit tests for rubyGem

Cache fetch results from rubyGemFetch

Cache fetch results from packagistFetch

Cache fetch results from crateioFetch

Cache fetch results from debianFetch

Cache fetch results from goFetch

Deep clone cached result on copy

Cache fetch results from nugetFetch

Add unit tests for podFetch

Cache results from podFetch

Delay fetchResult construction until end of fetch.

Delay fetchResult construction and transfer the clean up of the download directory at the end of the fetch.
This is to ensure when error occurs, the cleanup of the download directory will still be tracked in request.

Minor refactoring

Minor refactoring

Remove todo to avoid merge conflict

Adapt tests after merge

* Add ScopedQueueSets

ScopedQueueSets contains local and global scoped queue sets.
local scoped queueset holds tasks to be performed on the fetched result (package) that is currently processed and cached locally on the crawler instance.  This avoid refectch and increase the cache hit.
global scoped queueset is the shared queues among crawler instances.
local queueset is popped prior to the global one.  This ensures that cache is utilized before expiration.

* Publish requests on local queues to global upon crawler shutdown

Fix and add tests

Allow graceful shutdown

* Minor refactor and add more tests

* Update docker file to relay of shutdown signal

* Add config for dispatcher.fetched cache

After the scopedQueueSets is introduced, the tool tasks on the same fetched result (in the local scoped queueset) are processed consecutively.
Therefore, cache ttl for the fetched result can now be reduced.

* Address review comments

* Removed --init option in docker run

In my previous changes:
-nodejs application is run as PID 1 in the docker container, and
-the application can handle termination signals.

Therefore, --init option is not longer necessary and hence removed in docker run command.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants