You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[messages] Switch from JSON to ZIP / NDJSON format
This is a major rewrite of the message import / export code, that
switches the format from a single (standard) JSON file, with embedded
Base64 encoded MMS binary data, to a ZIP file containing a
Newline-delimited JSON (NDJSON) file ('messasges.ndjson'), containing
message metadata and text data, and a 'data' directory, containing the
untouched binary files stored natively by Android. There are a number of
advantages, as well as some disadvantages, to the new format:
Advantages:
-----------
Separating (encoded) binary data from text data and metadata results in
much cleaner text, which can be much more comfortably browsed by humans.
The ZIP file format is much more flexibile than the monolithic JSON file
format. E.g., additional information about the exporting system and app
and statistics about the export run can be easily included in another
file within the ZIP archive without substantially modifying the existing
export flow (this is not yet implemented, but will likely be in the
future.)
Using ZIP files automatically provides compression, although the
reduction in file size will depend on how much of the exported data is
compressible text (i.e., metadata and text data), as opposed to binary
data, which will generally be already compressed and not able to be
compressed much further.
Not including the binary data in the (ND)JSON eliminates the need to
read entire binary files into RAM at one time, resulting in much more
efficient RAM usage. This fixes#84, which was the initial impetus
for the format change.
NDJSON allows the reading of message records one at a time, eliminating
the need to use JSON streaming (see
#6), resulting in much simpler and
cleaner code.
Disadvantages:
--------------
The ZIP file format add code complexity.
NDJSON is less common then standard JSON.
NDJSON is less easily humanly-readable than the pretty-printed JSON
previously used (since NDJSON records cannot contain newlines), although
this can be easily mitigated by simply running 'jq < messages.ndjson' to
pretty-print the NDJSON.
Additional Changes:
-------------------
An additional change in this commit is the prefixing of a double
underscore to all (ND)JSON attributes added by the app (e.g.,
'__display_name', '__parts'), in order to clearly indicate that these
have been added by the app and are not the names of columns in the
Android message database tables.
Bugs:
-----
The current implementation of the new format works, although import
performance is unacceptably poor for large message collections. This is
apparently a consequence of the use of the InputStream paradigm
(required by Android's Storage Access framework) to access the ZIP file,
which allows only sequential access, not random access, and so accessing
each binary data file requires a sequential read from the beginning of
the ZIP file. This should be fixed in a subsequent commit.
Closes: #6, #84
0 commit comments