Archive storage design

Jump to bottom Edit New page

Josh Albrecht edited this page Dec 6, 2013 · 1 revision

Goals

Security: Should be able to run on devices that you physically control. Also should be easy to encrypt the data. See the Threat Model for more details.
Simplicity: The system should be easy to understand and interact with using standard tools.
Portability: The system should be capable of running on basically every system.
Performance: Given our particular assumptions about data size and access frequency, the system should have reasonable (sub-100ms) response to most queries, a write throughput sufficiently high for normal personal data generation (> 100KBps), and sufficient capacity for normal personal data (> 100GB per year)

Current Design

The current Archive implementation uses python.
sqlite is used for storage.
All Events with the same Namespace are stored in a single database file.
All databases are stored in the same folder.
Files (binary blobs) are stored (encrypted) in a separate folder and referenced by UUID

Justification

Python meets our needs for portability, simplicity, and performance. Java was NOT chosen due to the annoyance of requiring a jvm everywhere. C/C++ were ignored because of the difficulty of portability. Javascript was also ignored for portability reasons (Node servers on windows and android machines are currently janky)
Sqlite was chosen over all other databases due to the simplicity and ease of portability, as well as encryption. There is a nice package that can keep the entire database constantly encrypted and we'll probably use that.
Given that we're using sqlite, the choice is really whether everything should be stored in the same database, or in multiple databases. There doesn't seem to be any compelling reason to store all the data in the same database, and there are compelling reasons not to (sqlite locks at the file level, so this gives better performance in many cases, and this allows easier backup (smaller files) and lower chance of everything becoming corrupted)
Given that we're using multiple databases, sticking them in the same folder seems simplest. Then we can easily back it up, see what's there, etc.
To prevent the databases from growing large, and to allow multiple databases to refer to the same file without duplicating data, it makes sense to put the binary data into its own "blob store". Another benefit of this is that binary files in the blob store can be virtual--ie, you could store all Events on a single machine, even a cell phone, without using much space at all, as long as the binaries were shipped off somewhere else.