Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add init command and migrations #13

Merged
merged 2 commits into from
Oct 20, 2023

Conversation

edsu
Copy link
Contributor

@edsu edsu commented Oct 18, 2023

This commit adds a new init command which will initialize a SQLite database with the canonical warcdb schema. The inital schema was derived from importing tests/google.warc and using its schema as a starting place. The init step was added to the unit test, and the tests/apod.warc.gz file was added to the list of files that are tested so we can see that it works.

For command line users Initializing the database with init is required prior to running import. The database schema is managed with sqlite-migrate which is kind of new, but seems to work well.

So the process for working with warcdb is to:

$ warcdb init warc.db
$ warcdb import warc.db google.warc.gz

Then if you you want to update warcdb and apply migrations you can:

$ pip install --upgrade warcdb
$ warcdb migrate warc.db

Closes #12
Closes #6

@edsu edsu force-pushed the init-schema branch 2 times, most recently from caa91ce to 4f092da Compare October 18, 2023 21:21
@edsu edsu marked this pull request as ready for review October 18, 2023 21:21
@edsu edsu force-pushed the init-schema branch 2 times, most recently from fa5448a to 4e6924c Compare October 19, 2023 13:46
@edsu
Copy link
Contributor Author

edsu commented Oct 19, 2023

@Florents-Tselai one thing I was wondering is if maybe we should normalize the column names, so that they look a bit more like what we would expect to see in a database?

So instead of WARC-Record-ID it could be warc_record_id?

You may notice that currently we have some inconsistency, payload, http_headers. along with the capitalized ones.

This commit adds a new `init` command which will initialize a SQLite
database with the schema. Initializing the database is required prior to
running `import`. The database schema is managed with sqlite-migrate
which is kind of new, but seems to work well.

So the process for working with warcdb is to:

```bash
$ warcdb init warc.db
$ warcdb import warc.db google.warc.gz
```

Then if you you want to update warcdb and apply migrations you can:

```
$ pip install --upgrade warcdb
$ warcdb migrate warc.db
```

Closes Florents-Tselai#12
def test_import(warc_path):
runner = CliRunner()

with runner.isolated_filesystem() as fs:
DB_FILE = "test_warc.db"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to remove this runner.isolated_filesystem context manager because using the runner twice (once to init and then again to import) worked, but printed this annoying warning at the end of the run:

(warcdb-py3.11) ➜  WarcDB git:(init-schema) ✗ /Users/edsummers/.pyenv/versions/3.11.2/lib/python3.11/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching.  Some resources might leak.
  warnings.warn('resource_tracker: process died unexpectedly, '
Traceback (most recent call last):
  File "/Users/edsummers/.pyenv/versions/3.11.2/lib/python3.11/multiprocessing/resource_tracker.py", line 209, in main
    cache[rtype].remove(name)
KeyError: '/mp-6s_7o47z'

So instead I just removed the test db file at the end of each run.

from warcio import ArchiveIterator, StatusAndHeaders
from warcio.recordloader import ArcWarcRecord

from warcdb.migrations import migration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed unused imports and sorted them using isort.

Initialize a new warcdb database
"""
db = WarcDB(db_path)
migration.apply(db.db)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new command to initialize a database using the migrations.

@Florents-Tselai
Copy link
Owner

@Florents-Tselai one thing I was wondering is if maybe we should normalize the column names, so that they look a bit more like what we would expect to see in a database?

So instead of WARC-Record-ID it could be warc_record_id?

You may notice that currently we have some inconsistency, payload, http_headers. along with the capitalized ones.

Yes, in the first Iteration, I just dragged and drop the fields as they appear in the WARC spec, but hyphes complicate things a lot. Let's open this in a separate issue to discuss it; probably use camelCase instead of hyphens.

@@ -51,8 +46,7 @@ def record_payload(self: ArcWarcRecord):
@cache
def record_as_dict(self: ArcWarcRecord):
"""Method to easily represent a record as a dict, to be fed into db_utils.Database.insert()"""

return dict(self.rec_headers.headers)
return {k.lower().replace('-', '_'): v for k, v in self.rec_headers.headers}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column names are normalized by lower casing and replacing '-' with '_'. So WARC-Record-Id will be warc_record_id.

@edsu
Copy link
Contributor Author

edsu commented Oct 20, 2023

Oops I didn't see your comment beforehand. Hopefully snake_case works for column names? If not I can back this commit out, and we can address separately.

This commit normalizes the column names so that they are lowercased and
have underscores instead of dashes. Hopefully it's not disruptive for
existing uses of warcdb!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the warc-info record from google.warc and saved as no-warc-info.warc to test whether the import works when warc-info isn't present. Just to cut down on the size of the repo.

@Florents-Tselai
Copy link
Owner

I agree with the workflow of init & migrate;
But for first-time users I'd prefer to have something as a default to the latest schema or something. For first-time users, we should have as few keystrokes as possible.

But I'm merging this to unblock you and maybe we can circle back again.

@Florents-Tselai Florents-Tselai merged commit 7da8f4d into Florents-Tselai:main Oct 20, 2023
2 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Importing fails if a WARC file misses some records. Remove unused imports
2 participants