Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

How to connect db entries from the table "sites" to a belonging warc-file? #156

Open
mxnx1 opened this issue Jun 17, 2019 · 2 comments
Open

Comments

@mxnx1
Copy link

mxnx1 commented Jun 17, 2019

Hi brozzler-team,

I want to export database entries belonging to a specific warc-file, from the tables jobs, sites and pages.
I Know how connect those tables to each other, but i couldn't find a connection to the table captures or directly to the belonging warc-file.

Is it working via the "WARC_Date" in the warcinfo record of the warc-file and "last_claimed" in the table sites?

A hint Would be great. Thx.

@nlevitt
Copy link
Contributor

nlevitt commented Jun 17, 2019

You can set the warc prefix using warcprox-meta as shown here: https://github.com/internetarchive/brozzler/blob/master/job-conf.rst#using-warcprox-meta

If you don't, captures from all your jobs and sites will be mixed together in the same warcs.

@mxnx1
Copy link
Author

mxnx1 commented Jun 18, 2019

thank you for your reply.
I use the warc_prefix, but I have several warc-files with the same warc-prefix, differentiating through timestamp and some id, which are created automatically.

example of warc file names:
Chile_Google_Search_Countries-20190609202650584-13pemhtq-00000.warc.gz
Chile_Google_Search_Countries-20190526122808538-7aek1ud9-00000.warc.gz


on brozzler dashboard the navigation through it and to the captured content goes
via Jobs - sites - pages - wayback.
so the table entries are explicitly connected to the belonging warc-files.

i understand how the tables jobs, sites and pages are connected - via job_id and site_id.
But i am wondering how brozzler is connecting the warc-files to its table entries (jobs, sites, pages).

  • i can imagine an improvised connection, but it is not very explicit:
    I think about a combination between the warc-prefix and the date from "last_claimed" from the table sites, to find the matching warc-file via its filename or its WARC-Date.
    But the date from Warc-Date (warcinfo record) and last_claimed (table sites) are not totaly similar and differ one second.

  • i am missing an explicit corresponding field.

i need this connection for exporting the belonging informations (in jobs, sites, pages) about the warc-files from the database.

Can you tell me how brozzler connect the warc-files to its table entries jobs, sites, pages?


part of sites entry:

"active_brozzling_time": 31.814205646514893 ,
"claimed": false ,
"cookie_db": <binary, 20.0KB, "53 51 4c 69 74 65..."> ,
"id": "7133eeeb-9e57-4ccf-837d-08e427c1a4fa" ,
"ignore_robots": true ,
"job_id": "google_search_countries_09062019" ,
"last_claimed": Sun Jun 09 2019 20:26:49 GMT+00:00 , 				
"last_claimed_by": "xxxxxxx" ,
"last_disclaimed": Sun Jun 09 2019 20:27:21 GMT+00:00 , 
....

"warcprox_meta": {
"warc-prefix": "Chile_Google_Search_Countries"       				
}

example of warcinfo record:

WARC/1.0
WARC-Record-ID: urn:uuid:4cd61096-662e-442f-ad1a-fe5eb870cae4
WARC-Type: warcinfo
WARC-Filename: Chile_Google_Search_Countries-20190609202650584-13pemhtq-00000.warc.gz
WARC-Date: 2019-06-09T20:26:50Z
Content-Type: application/warc-fields
Content-Length: 99

software: warcprox 2.4b6
hostname: xxxxxxx
ip: xxxxxxxx
format: WARC File Format 1.0

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants