Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Performance Suggestions? #185

Open
rovo79 opened this issue Dec 10, 2019 · 1 comment
Open

Performance Suggestions? #185

rovo79 opened this issue Dec 10, 2019 · 1 comment

Comments

@rovo79
Copy link

rovo79 commented Dec 10, 2019

Hello,
I've been utilizing brozzler-easy for testing and brozzler looks to be working wonderfully. I have a very large website I am trying to archive and unsure of a few things that I can't figure out through the job-conf.rst.

I'm running a local version of the website on my local machine. So that site is not running from it's public domain. Is there way to get brozzler to replace my local host domain with the actual public domain?

Another question I have, is there any way to boost the performance? Possibly configure it to use more threads? Currently when I setup a brozzler job and monitor it in Brozzler Dashboard, it shows two sites being actively crawled. Is that an example of Brozzler running two threads to crawl the site?

Maybe there's a writeup somewhere explaining optimal ways to use brozzler on a local machine?

greatly appreciate any insights. Sorry to post this here, not sure how else to get in touch with people on this project.

Thank you.

@nlevitt
Copy link
Contributor

nlevitt commented Dec 23, 2019

I'm running a local version of the website on my local machine. So that site is not running from it's public domain. Is there way to get brozzler to replace my local host domain with the actual public domain?

Neither brozzler nor warcprox have that functionality built in. But it sounds doable with /etc/hosts.

Another question I have, is there any way to boost the performance? Possibly configure it to use more threads?

You can configure the number of browsers running simultaneously with the -n,--max-browsers option. But only one browser at a time will work on a single site. You might need to reorganize your crawl if you want more parallelization (depending on what you're doing).

Maybe there's a writeup somewhere explaining optimal ways to use brozzler on a local machine?

I'm afraid not, as far as I'm aware :-\

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants