Add init command and migrations #13

edsu · 2023-10-18T21:12:54Z

This commit adds a new init command which will initialize a SQLite database with the canonical warcdb schema. The inital schema was derived from importing tests/google.warc and using its schema as a starting place. The init step was added to the unit test, and the tests/apod.warc.gz file was added to the list of files that are tested so we can see that it works.

For command line users Initializing the database with init is required prior to running import. The database schema is managed with sqlite-migrate which is kind of new, but seems to work well.

So the process for working with warcdb is to:

$ warcdb init warc.db
$ warcdb import warc.db google.warc.gz

Then if you you want to update warcdb and apply migrations you can:

$ pip install --upgrade warcdb
$ warcdb migrate warc.db

Closes #12
Closes #6

edsu · 2023-10-19T14:00:23Z

@Florents-Tselai one thing I was wondering is if maybe we should normalize the column names, so that they look a bit more like what we would expect to see in a database?

So instead of WARC-Record-ID it could be warc_record_id?

You may notice that currently we have some inconsistency, payload, http_headers. along with the capitalized ones.

This commit adds a new `init` command which will initialize a SQLite database with the schema. Initializing the database is required prior to running `import`. The database schema is managed with sqlite-migrate which is kind of new, but seems to work well. So the process for working with warcdb is to: ```bash $ warcdb init warc.db $ warcdb import warc.db google.warc.gz ``` Then if you you want to update warcdb and apply migrations you can: ``` $ pip install --upgrade warcdb $ warcdb migrate warc.db ``` Closes Florents-Tselai#12

edsu · 2023-10-20T09:32:36Z

tests/test_warcdb.py

 def test_import(warc_path):
    runner = CliRunner()

-    with runner.isolated_filesystem() as fs:
-        DB_FILE = "test_warc.db"


I had to remove this runner.isolated_filesystem context manager because using the runner twice (once to init and then again to import) worked, but printed this annoying warning at the end of the run:

(warcdb-py3.11) ➜ WarcDB git:(init-schema) ✗ /Users/edsummers/.pyenv/versions/3.11.2/lib/python3.11/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching. Some resources might leak. warnings.warn('resource_tracker: process died unexpectedly, ' Traceback (most recent call last): File "/Users/edsummers/.pyenv/versions/3.11.2/lib/python3.11/multiprocessing/resource_tracker.py", line 209, in main cache[rtype].remove(name) KeyError: '/mp-6s_7o47z'

So instead I just removed the test db file at the end of each run.

edsu · 2023-10-20T09:33:33Z

warcdb/__init__.py

+from warcio import ArchiveIterator, StatusAndHeaders
+from warcio.recordloader import ArcWarcRecord
+
+from warcdb.migrations import migration



I removed unused imports and sorted them using isort.

edsu · 2023-10-20T09:35:23Z

warcdb/__init__.py

+    Initialize a new warcdb database
+    """
+    db = WarcDB(db_path)
+    migration.apply(db.db)


This is the new command to initialize a database using the migrations.

Florents-Tselai · 2023-10-20T09:47:25Z

@Florents-Tselai one thing I was wondering is if maybe we should normalize the column names, so that they look a bit more like what we would expect to see in a database?

So instead of WARC-Record-ID it could be warc_record_id?

You may notice that currently we have some inconsistency, payload, http_headers. along with the capitalized ones.

Yes, in the first Iteration, I just dragged and drop the fields as they appear in the WARC spec, but hyphes complicate things a lot. Let's open this in a separate issue to discuss it; probably use camelCase instead of hyphens.

edsu · 2023-10-20T10:13:16Z

warcdb/__init__.py

@@ -51,8 +46,7 @@ def record_payload(self: ArcWarcRecord):
 @cache
 def record_as_dict(self: ArcWarcRecord):
    """Method to easily represent a record as a dict, to be fed into db_utils.Database.insert()"""
-
-    return dict(self.rec_headers.headers)
+    return {k.lower().replace('-', '_'): v for k, v in self.rec_headers.headers}


Column names are normalized by lower casing and replacing '-' with '_'. So WARC-Record-Id will be warc_record_id.

edsu · 2023-10-20T10:16:40Z

Oops I didn't see your comment beforehand. Hopefully snake_case works for column names? If not I can back this commit out, and we can address separately.

This commit normalizes the column names so that they are lowercased and have underscores instead of dashes. Hopefully it's not disruptive for existing uses of warcdb!

edsu · 2023-10-20T11:56:01Z

tests/no-warc-info.warc

I removed the warc-info record from google.warc and saved as no-warc-info.warc to test whether the import works when warc-info isn't present. Just to cut down on the size of the repo.

Florents-Tselai · 2023-10-20T13:57:05Z

I agree with the workflow of init & migrate;
But for first-time users I'd prefer to have something as a default to the latest schema or something. For first-time users, we should have as few keystrokes as possible.

But I'm merging this to unblock you and maybe we can circle back again.

edsu force-pushed the init-schema branch 2 times, most recently from caa91ce to 4f092da Compare October 18, 2023 21:21

edsu marked this pull request as ready for review October 18, 2023 21:21

edsu force-pushed the init-schema branch 2 times, most recently from fa5448a to 4e6924c Compare October 19, 2023 13:46

edsu force-pushed the init-schema branch from 4e6924c to af8d544 Compare October 20, 2023 09:30

edsu commented Oct 20, 2023

View reviewed changes

edsu force-pushed the init-schema branch from bfde7cd to f5df7a6 Compare October 20, 2023 10:25

Normalize column names

763582a

This commit normalizes the column names so that they are lowercased and have underscores instead of dashes. Hopefully it's not disruptive for existing uses of warcdb!

edsu force-pushed the init-schema branch from f5df7a6 to 763582a Compare October 20, 2023 11:53

edsu commented Oct 20, 2023

View reviewed changes

edsu mentioned this pull request Oct 20, 2023

Support import from WACZ files #16

Merged

Florents-Tselai closed this Oct 20, 2023

Florents-Tselai reopened this Oct 20, 2023

Florents-Tselai merged commit 7da8f4d into Florents-Tselai:main Oct 20, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add init command and migrations #13

Add init command and migrations #13

edsu commented Oct 18, 2023 •

edited

Loading

edsu commented Oct 19, 2023 •

edited

Loading

edsu Oct 20, 2023

edsu Oct 20, 2023

edsu Oct 20, 2023

Florents-Tselai commented Oct 20, 2023

edsu Oct 20, 2023

edsu commented Oct 20, 2023 •

edited

Loading

edsu Oct 20, 2023

Florents-Tselai commented Oct 20, 2023

Add init command and migrations #13

Add init command and migrations #13

Conversation

edsu commented Oct 18, 2023 • edited Loading

edsu commented Oct 19, 2023 • edited Loading

edsu Oct 20, 2023

Choose a reason for hiding this comment

edsu Oct 20, 2023

Choose a reason for hiding this comment

edsu Oct 20, 2023

Choose a reason for hiding this comment

Florents-Tselai commented Oct 20, 2023

edsu Oct 20, 2023

Choose a reason for hiding this comment

edsu commented Oct 20, 2023 • edited Loading

edsu Oct 20, 2023

Choose a reason for hiding this comment

Florents-Tselai commented Oct 20, 2023

edsu commented Oct 18, 2023 •

edited

Loading

edsu commented Oct 19, 2023 •

edited

Loading

edsu commented Oct 20, 2023 •

edited

Loading