Reduce the number of fetches harvesting one component #475

qtomlinson · 2022-06-20T19:46:55Z

Summary:

Implemented Caching fetchResult and in progress fetches.
-introduced FetchResult and a cache of FetchResults in the Dispatcher to prevent multiple subsequent fetches for the same coordinates. The fetch result can be reused for various code analyses: clearlydefined, licensee, scancode, reuse, and in the future fosology.
-also implemented a cache for inProgressFetches (promises) to avoid multiple concurrent fetches for the same coordinates.
Added ScopedQueueSets for local and global scoped queue sets.

local scoped queueset holds tasks to be performed on the fetched result (package) that is currently processed and cached locally on the crawler instance. This avoid refectch and increase the cache hit.
global scoped queueset is the shared queues among crawler instances.
local queueset is popped prior to the global one. This ensures that cache is utilized before expiration.

Task: #464

qtomlinson · 2022-06-20T19:52:40Z

Performance testing in my local dev environment shows ~10% improvement in processing the following maven components.
Average before change: 237 sec
Average after change: 212 sec

Post call to: localhost:5000/requests
Payload:

[
    {
        "type": "component", 
        "url": "cd:/maven/mavencentral/org.apache.httpcomponents/httpcore/4.3"
    },    {
        "type": "component", 
        "url": "cd:/maven/mavencentral/org.apache.httpcomponents/httpcore/4.3.1"
    },    {
        "type": "component", 
        "url": "cd:/maven/mavencentral/org.apache.httpcomponents/httpcore/4.3.2"
    },    {
        "type": "component", 
        "url": "cd:/maven/mavencentral/org.apache.httpcomponents/httpcore/4.3.3"
    }
]

In addition to performance consideration, the repositories where artifacts are retrieved tend to have rate limits as well. This needs to be taken into account. As the crawler service scales and more processors (e.g. reuse, or fossology in the future) added to harvesting, this can potentially be a concern if not addressed.

qtomlinson · 2022-06-20T19:58:52Z

@disulliv @MichaelTsengLZ The git history seems to have changed. My previous pull request now contains reuse changes as well. This is the cleaned up version of my previous pull request . It is ready for review :)

Compared to my previous pull request:
-no additional changes.
-All the cached related commits are squashed as one: "Cache in progress fetch promises, cached fetched results"
-The three commits related to "publish to global queue on crawler shutdown" is squashed as "Publish requests on local queues to global upon crawler shutdown"

qtomlinson · 2022-06-21T21:33:05Z

Test case 2 (payload below) also showed >10% improvement in processing time.
Data correctness is also confirmed by validating harvested data against that generated by the master branch. Aside from timestamps and local temporary directory names, there are no differences in harvested data.

[
    {
        "type": "component", 
        "url": "cd:/maven/mavencentral/org.apache.httpcomponents/httpcore/4.3"
    },    {
        "type": "component", 
        "url": "cd:/maven/gradleplugin/io.github.lognet/grpc-spring-boot-starter-gradle-plugin/4.6.0"
    },    {
        "type": "component", 
        "url": "cd:/maven/mavengoogle/android.arch.lifecycle/common/1.0.1"
    }, {
        "type": "component", 
        "url": "cd:/crate/cratesio/-/bitflags/1.0.4"
    },    {
        "type": "component", 
        "url": "cd:/npm/npmjs/-/redis/0.1.0"
    }, {
        "type": "component", 
        "url": "cd:/git/github/bitflags/bitflags/518aaf91494e94f41651a40f1b38d6ab522b0235"
    }, {
        "type": "component", 
        "url": "cd:/pypi/pypi/-/backports.ssl_match_hostname/3.7.0.1"
    }, {
        "type": "component", 
        "url": "cd:/gem/rubygems/-/small/0.4"
    }, {
        "type": "component", 
        "url": "cd:/composer/packagist/symfony/polyfill-mbstring/1.11.0"
    }, {
        "type": "component", 
        "url": "cd:/go/golang/rsc.io/quote/v1.3.0"
    }, {
        "type": "component", 
        "url": "cd:/nuget/nuget/-/xunit.core/2.4.1"
    }, {
        "type": "component", 
        "url": "cd:/pod/cocoapods/-/SoftButton/0.1.0"
    }, {
	 "type": "deb",
	"url": "cd:/deb/debian/-/mini-httpd/1.30-0.2_arm64"
    }	
]

README.md

ghcrawler/providers/queuing/scopedQueueSets.js

Cache in progress fetch promises, cached fetched results for maven Add a unit test for gitCloner Cache fetch results from gitCloner Add a unit test for pypiFetch Cache fetch results from pypiFetch Minor refactoring Cache fetch results from npmjsFetch Add unit tests for rubyGem Cache fetch results from rubyGemFetch Cache fetch results from packagistFetch Cache fetch results from crateioFetch Cache fetch results from debianFetch Cache fetch results from goFetch Deep clone cached result on copy Cache fetch results from nugetFetch Add unit tests for podFetch Cache results from podFetch Delay fetchResult construction until end of fetch. Delay fetchResult construction and transfer the clean up of the download directory at the end of the fetch. This is to ensure when error occurs, the cleanup of the download directory will still be tracked in request. Minor refactoring Minor refactoring Remove todo to avoid merge conflict Adapt tests after merge

ScopedQueueSets contains local and global scoped queue sets. local scoped queueset holds tasks to be performed on the fetched result (package) that is currently processed and cached locally on the crawler instance. This avoid refectch and increase the cache hit. global scoped queueset is the shared queues among crawler instances. local queueset is popped prior to the global one. This ensures that cache is utilized before expiration.

Fix and add tests Allow graceful shutdown

After the scopedQueueSets is introduced, the tool tasks on the same fetched result (in the local scoped queueset) are processed consecutively. Therefore, cache ttl for the fetched result can now be reduced.

MichaelTsengLZ · 2022-07-12T05:22:56Z

I will merge and test this tomorrow morning on the dev environment.

In my previous changes: -nodejs application is run as PID 1 in the docker container, and -the application can handle termination signals. Therefore, --init option is not longer necessary and hence removed in docker run command.

qtomlinson · 2022-07-20T21:13:37Z

@MichaelTsengLZ Any more improvements to be made?

qtomlinson · 2022-07-27T23:49:27Z

@MichaelTsengLZ Is there a way to check for deadletters in the crawler? Partial harvests observed for the following components on production
[{
"type": "component",
"url": "cd:/pypi/pypi/-/numba/0.56.0"
},
{
"type": "component",
"url": "cd:/nuget/nuget/-/Microsoft.VisualStudio.DiagnosticsHub.CorProfiler/17.4.32726.1"
},
{
"type": "component",
"url": "cd:/sourcearchive/mavencentral/com.google.dagger/hilt-android/2.43"
},
{
"type": "component",
"url": "cd:/nuget/nuget/-/Microsoft.VSSDK.CompatibilityAnalyzer/17.2.2197"
},
{
"type": "component",
"url": "cd:/composer/packagist/cakephp/cakephp/4.4.3"
},
{
"type": "component",
"url": "cd:/sourcearchive/mavencentral/org.eclipse.leshan/leshan-client-cf/2.0.0-M8"
},
{
"type": "component",
"url": "cd:/sourcearchive/mavencentral/org.eclipse.leshan/leshan-server-cf/1.4.1"
},
{
"type": "component",
"url": "cd:/nuget/nuget/-/Microsoft.IdentityModel.Tokens/6.22.0"
},
{
"type": "component",
"url": "cd:/pypi/pypi/-/pytorch-ignite/0.5.0.dev20220727"
}]
These ran fine in my local dev environment. If some of the processors failed, those should be recorded in deadletters.

qtomlinson · 2022-07-28T16:03:25Z

Partial harvests seem to exist prior to this commit. See:
https://clearlydefined.io/definitions/sourcearchive/mavencentral/org.zkoss.zk/zkbind/9.6.0.2
'2022-06-13T00:44:04.583Z'

https://clearlydefined.io/definitions/sourcearchive/mavencentral/org.zkoss.zk/zhtml/9.6.0.2
'2022-06-20T06:28:31.348Z'

https://clearlydefined.io/definitions/sourcearchive/mavencentral/software.amazon.awssdk/protocol-core/2.17.206
'2022-06-28T06:01:23.942Z'

https://clearlydefined.io/definitions/sourcearchive/mavencentral/org.wso2.carbon/org.wso2.carbon.ui/4.7.0-beta9
"2022-07-15T18:09:53.354Z"

https://clearlydefined.io/definitions/sourcearchive/mavencentral/org.wso2.carbon/org.wso2.carbon.core/4.7.0-beta9
'2022-07-16T12:21:45.799Z'

MichaelTsengLZ · 2022-07-28T16:23:08Z

I haven't push the your commits from dev to prod yet because I found the crawler dev is dead yesterday. I'm trying to figure out what's wrong on dev because locally run is OK.

2022-07-27T22:30:29.200Z INFO  - Initiating warmup request to container cdcrawler-dev_0_359c8f30 for site cdcrawler-dev

2022-07-27T22:30:44.269Z INFO  - Waiting for response to warmup request for container cdcrawler-dev_0_359c8f30. Elapsed time = 15.0689425 sec

2022-07-27T22:30:30.636401135Z [I] appInitStart {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.637419541Z [I] creating refreshing options with crawlerName:crawler {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.638018644Z [I] creating refreshing options crawler with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.638622048Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.640544260Z [I] creating refreshing options filter with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.641123463Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.641669267Z [I] creating refreshing options fetch with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.642227270Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.642798474Z [I] creating refreshing options process with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.643492678Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.644082081Z [I] creating refreshing options queue with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.644620985Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.645202988Z [I] creating refreshing options store with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.645242788Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.645777292Z [I] creating refreshing options deadletter with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.645912692Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.646548296Z [I] creating refreshing options lock with provider memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.646648497Z [I] creating in memory refreshing config {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.657678264Z (node:1) Warning: Accessing non-existent property 'padLevels' of module exports inside circular dependency

2022-07-27T22:30:30.657723864Z (Use `node --trace-warnings ...` to show where the warning was created)

2022-07-27T22:30:30.659623976Z [I] got refreshingOption values for crawler {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.660282980Z [I] got refreshingOption values for filter {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.660863083Z [I] got refreshingOption values for fetch {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.661446187Z [I] got refreshingOption values for process {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.662254292Z [I] got refreshingOption values for queue {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.662857895Z [I] got refreshingOption values for store {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.663532399Z [I] got refreshingOption values for deadletter {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.664121603Z [I] got refreshingOption values for lock {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.668645130Z [I] filter options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.669199734Z [I] lock options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.669836538Z [I] crawler options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.670461741Z [I] fetch options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.670477742Z [I] process options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.670483342Z [I] queue options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.671104645Z [I] store options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.671121145Z [I] deadletter options initialized {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.671692049Z [I] created all refreshingOptions {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.672271152Z [I] creating crawler {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.672352353Z [I] creating queue:storageQueue {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.679792798Z [I] creating queue:memory {"crawlerId":"1ae341d8-60df-4c64-8ce9-cd64bf5a5b63","crawlerHost":"dev","buildNumber":"20220722.1"}

2022-07-27T22:30:30.680403702Z Service initialization error: Cannot read properties of undefined (reading 'on')

2022-07-27T22:30:30.682179613Z TypeError: Cannot read properties of undefined (reading 'on')

2022-07-27T22:30:30.682198313Z     at new QueueSet (/opt/service/ghcrawler/providers/queuing/queueSet.js:9:26)

2022-07-27T22:30:30.682203913Z     at Function.createQueueSet (/opt/service/ghcrawler/crawlerFactory.js:218:12)

2022-07-27T22:30:30.682254813Z     at module.exports (/opt/service/ghcrawler/providers/queuing/memoryFactory.js:14:25)

2022-07-27T22:30:30.682262213Z     at Function._getProvider (/opt/service/ghcrawler/crawlerFactory.js:156:26)

2022-07-27T22:30:30.682266213Z     at Function.createQueues (/opt/service/ghcrawler/crawlerFactory.js:210:27)

2022-07-27T22:30:30.682270013Z     at Function.createScopedQueueSets (/opt/service/ghcrawler/crawlerFactory.js:223:40)

2022-07-27T22:30:30.682273813Z     at Function.createCrawler (/opt/service/ghcrawler/crawlerFactory.js:62:39)

2022-07-27T22:30:30.682277613Z     at /opt/service/ghcrawler/crawlerFactory.js:34:38

2022-07-27T22:30:30.682281413Z     at processTicksAndRejections (node:internal/process/task_queues:96:5)

2022-07-27T22:30:30.682443714Z Error initializing the Express app: TypeError: Cannot read properties of undefined (reading 'on')

2022-07-27T22:30:30.688706852Z trackException:

2022-07-27T22:30:30.688726252Z Error: TypeError: Cannot read properties of undefined (reading 'on')

2022-07-27T22:30:30.688732152Z     at /opt/service/ghcrawler/bin/www.js:31:13

2022-07-27T22:30:30.688736252Z     at /opt/service/node_modules/express-init/lib/index.js:50:14

2022-07-27T22:30:30.688740152Z     at /opt/service/node_modules/async/dist/async.js:2959:19

2022-07-27T22:30:30.688743952Z     at wrapper (/opt/service/node_modules/async/dist/async.js:272:20)

2022-07-27T22:30:30.688748052Z     at iterateeCallback (/opt/service/node_modules/async/dist/async.js:417:21)

2022-07-27T22:30:30.688751952Z     at /opt/service/node_modules/async/dist/async.js:325:20

2022-07-27T22:30:30.688755852Z     at /opt/service/node_modules/async/dist/async.js:2957:17

2022-07-27T22:30:30.688759652Z     at /opt/service/node_modules/express-init/lib/index.js:36:20

2022-07-27T22:30:30.688763453Z     at /opt/service/ghcrawler/app.js:50:9

2022-07-27T22:30:30.688767153Z     at processTicksAndRejections (node:internal/process/task_queues:96:5)



2022-07-27T22:30:46.292Z ERROR - Container cdcrawler-dev_0_359c8f30 for site cdcrawler-dev has exited, failing site start

2022-07-27T22:30:46.337Z ERROR - Container cdcrawler-dev_0_359c8f30 didn't respond to HTTP pings on port: 5000, failing site start. See container logs for debugging.

2022-07-27T22:30:46.344Z INFO  - Stopping site cdcrawler-dev because it failed during startup.

qtomlinson · 2022-07-28T19:07:55Z

In Dev env, we have local and global queue both as memory, but in dev deployment, we would have memory for local and storage queue for global queue. This is the case that needs some more work.

qtomlinson · 2022-07-29T15:55:44Z

PR #480 is to address the startup issue on the dev deployment.

MichaelTsengLZ · 2022-07-29T17:29:45Z

Cool. Looking at this right now.

MichaelTsengLZ · 2022-07-29T23:11:08Z

@qtomlinson PR #480 fix the issue and the App Service dev environment works well. The only issue now is that when
I restarted the App Service, there was no log saying Server closed.. This means your graceful shutdown process won't work. I think Azure App Service doesn't use docker stop ${container} to stop the crawler.

qtomlinson · 2022-08-19T00:39:42Z

@mpcen As a follow up on Michael's observation (graceful shutdown not triggered during crawler restart), a similar question was raised in 2020: graceful shutdown on Azure AppServices (Linux/Docker) via IHostApplicationLifeTime.

According to documentation, App Service restart uses docker restart. During local testing, docker restart or docker stop triggers the graceful shutdown of the crawler, and Received SIGTERM (start of shutdown) and Server closed (end of shutdown) are noted in the logs. In logs from a crawler container deployed on App Service, however, Received SIGTERM and Server closed are both missing during webapp restart and stop. Setting verbose level logging either via az webapp start --verbose or az webapp log config does not yield more information. Any suggestion on further investigation?

One possible explanation is the default time to wait for container to exit is different. In docker stop or docker restart, the default is 10 sec and can be modified via -t. In Azure App Service, the default is 5 seconds, according to stackoverflow. WEBSITES_CONTAINER_STOP_TIME_LIMIT can potentially be used to configure the wait. This setting is yet to be documented at Environment variables and app settings in Azure App Service, and it may be worth trying when the setting is released.

The end result of missing graceful shutdown during crawler restart is that in-progress harvests before shutdown will be partial. Currently, partial harvest cases are also present during normal production run (see comment and additional comment). Users can retrigger package harvesting to work around this. What do you think?

mpcen · 2022-08-22T22:13:52Z

@qtomlinson do you know approximately how much more time we'd need to wait for the process to complete? It looks like WEBSITES_CONTAINER_STOP_TIME_LIMIT has been rolled out but just hasn't been documented. My only concern with this approach would be a potential application deadlock (if the waiting period is long) but maybe this won't be an issue since we have multiple crawler instances?

I'd say give it a go in DEV by playing around with some numbers for the time.

qtomlinson · 2022-09-14T15:09:32Z

@mpcen Setting WEBSITES_CONTAINER_STOP_TIME_LIMIT longer (e.g. 20s or higher), did not seem to help: shutdown logging is missing in the container logs. Shutdown sequence was occasionally visible in streaming logs (
/appsvctmp/volatile/logs/runtime/6a12ea3ebdd5e50816da5d11fb57c328aaa6d8848f95df0a50a4a69a62c9b329.log), but not in the container logs accessible via public api. The intriguing piece is that the beginning of the shutdown is missing in the log. If the shutdown time is not enough, the beginning of the shutdown should have been visible in the log.

As an experiment, nginx container image was deployed, the graceful shutdown of that container was also missing in its container logs.

Without logging, an alternate is to confirm whether the side effect of the graceful shutdown has occurred. A storage queue was set up with a test crawler web app on App Service to mimic the queue used in production environment. After shutdown, additional sub tasks of package harvest were observed on the shared storage queue. This means that the publishing of local in-memory tasks (from the crawler instance) to the globally shared storage queue during graceful shutdown has been triggered.

The missing of shutdown sequence in the container log seems to be a logging issue in App Service.

qtomlinson · 2022-11-19T15:45:04Z

To overcome that the graceful shutdown may not be triggered, StorageBackedQueue was introduced and can be used as local queue.

…#475) * Cache in progress fetch promises, cached fetched results Cache in progress fetch promises, cached fetched results for maven Add a unit test for gitCloner Cache fetch results from gitCloner Add a unit test for pypiFetch Cache fetch results from pypiFetch Minor refactoring Cache fetch results from npmjsFetch Add unit tests for rubyGem Cache fetch results from rubyGemFetch Cache fetch results from packagistFetch Cache fetch results from crateioFetch Cache fetch results from debianFetch Cache fetch results from goFetch Deep clone cached result on copy Cache fetch results from nugetFetch Add unit tests for podFetch Cache results from podFetch Delay fetchResult construction until end of fetch. Delay fetchResult construction and transfer the clean up of the download directory at the end of the fetch. This is to ensure when error occurs, the cleanup of the download directory will still be tracked in request. Minor refactoring Minor refactoring Remove todo to avoid merge conflict Adapt tests after merge * Add ScopedQueueSets ScopedQueueSets contains local and global scoped queue sets. local scoped queueset holds tasks to be performed on the fetched result (package) that is currently processed and cached locally on the crawler instance. This avoid refectch and increase the cache hit. global scoped queueset is the shared queues among crawler instances. local queueset is popped prior to the global one. This ensures that cache is utilized before expiration. * Publish requests on local queues to global upon crawler shutdown Fix and add tests Allow graceful shutdown * Minor refactor and add more tests * Update docker file to relay of shutdown signal * Add config for dispatcher.fetched cache After the scopedQueueSets is introduced, the tool tasks on the same fetched result (in the local scoped queueset) are processed consecutively. Therefore, cache ttl for the fetched result can now be reduced. * Address review comments * Removed --init option in docker run In my previous changes: -nodejs application is run as PID 1 in the docker container, and -the application can handle termination signals. Therefore, --init option is not longer necessary and hence removed in docker run command.

qtomlinson marked this pull request as ready for review June 20, 2022 19:55

qtomlinson force-pushed the qt/reduce_fetch branch from 2f917fd to 5bb1f51 Compare June 21, 2022 20:56

MichaelTsengLZ approved these changes Jun 22, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

ghcrawler/providers/queuing/scopedQueueSets.js Outdated Show resolved Hide resolved

qtomlinson force-pushed the qt/reduce_fetch branch from 4a0f831 to b26f5f3 Compare June 23, 2022 23:16

qtomlinson added 7 commits July 5, 2022 14:09

Publish requests on local queues to global upon crawler shutdown

f6f7194

Fix and add tests Allow graceful shutdown

Minor refactor and add more tests

490558a

Update docker file to relay of shutdown signal

ac380e8

Add config for dispatcher.fetched cache

9013176

After the scopedQueueSets is introduced, the tool tasks on the same fetched result (in the local scoped queueset) are processed consecutively. Therefore, cache ttl for the fetched result can now be reduced.

Address review comments

0dcb92f

qtomlinson force-pushed the qt/reduce_fetch branch from b26f5f3 to 0dcb92f Compare July 5, 2022 21:12

MichaelTsengLZ approved these changes Jul 12, 2022

View reviewed changes

qtomlinson marked this pull request as draft July 14, 2022 18:18

Removed --init option in docker run

ea14613

In my previous changes: -nodejs application is run as PID 1 in the docker container, and -the application can handle termination signals. Therefore, --init option is not longer necessary and hence removed in docker run command.

qtomlinson force-pushed the qt/reduce_fetch branch from f0e4cfc to ea14613 Compare July 14, 2022 19:22

qtomlinson marked this pull request as ready for review July 14, 2022 19:30

MichaelTsengLZ merged commit 661d709 into clearlydefined:master Jul 22, 2022

qtomlinson deleted the qt/reduce_fetch branch July 25, 2022 14:28

This was referenced Sep 27, 2022

Crawler tools queuing individually rather than as a whole #482

Closed

Investigate why Azure App Service does not log Node Processes #484

Open

Turning on crawlers post-optimization #485

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the number of fetches harvesting one component #475

Reduce the number of fetches harvesting one component #475

qtomlinson commented Jun 20, 2022 •

edited

Loading

qtomlinson commented Jun 20, 2022 •

edited

Loading

qtomlinson commented Jun 20, 2022 •

edited

Loading

qtomlinson commented Jun 21, 2022 •

edited

Loading

MichaelTsengLZ commented Jul 12, 2022 •

edited

Loading

qtomlinson commented Jul 20, 2022

qtomlinson commented Jul 27, 2022

qtomlinson commented Jul 28, 2022

MichaelTsengLZ commented Jul 28, 2022

qtomlinson commented Jul 28, 2022

qtomlinson commented Jul 29, 2022

MichaelTsengLZ commented Jul 29, 2022

MichaelTsengLZ commented Jul 29, 2022 •

edited

Loading

qtomlinson commented Aug 19, 2022

mpcen commented Aug 22, 2022

qtomlinson commented Sep 14, 2022 •

edited

Loading

qtomlinson commented Nov 19, 2022 •

edited

Loading

Reduce the number of fetches harvesting one component #475

Reduce the number of fetches harvesting one component #475

Conversation

qtomlinson commented Jun 20, 2022 • edited Loading

qtomlinson commented Jun 20, 2022 • edited Loading

qtomlinson commented Jun 20, 2022 • edited Loading

qtomlinson commented Jun 21, 2022 • edited Loading

MichaelTsengLZ commented Jul 12, 2022 • edited Loading

qtomlinson commented Jul 20, 2022

qtomlinson commented Jul 27, 2022

qtomlinson commented Jul 28, 2022

MichaelTsengLZ commented Jul 28, 2022

qtomlinson commented Jul 28, 2022

qtomlinson commented Jul 29, 2022

MichaelTsengLZ commented Jul 29, 2022

MichaelTsengLZ commented Jul 29, 2022 • edited Loading

qtomlinson commented Aug 19, 2022

mpcen commented Aug 22, 2022

qtomlinson commented Sep 14, 2022 • edited Loading

qtomlinson commented Nov 19, 2022 • edited Loading

qtomlinson commented Jun 20, 2022 •

edited

Loading

qtomlinson commented Jun 20, 2022 •

edited

Loading

qtomlinson commented Jun 20, 2022 •

edited

Loading

qtomlinson commented Jun 21, 2022 •

edited

Loading

MichaelTsengLZ commented Jul 12, 2022 •

edited

Loading

MichaelTsengLZ commented Jul 29, 2022 •

edited

Loading

qtomlinson commented Sep 14, 2022 •

edited

Loading

qtomlinson commented Nov 19, 2022 •

edited

Loading