Disk-Based Direct Address Table Toolkit

Overview

A Python-based tool for creating and querying a key-value mapping backed by disk-based direct address tables. This system uses disk seeking to manage key-value data efficiently, making it possible to query large datasets without loading everything into memory.

Features

Create keys.bin and values.bin files for efficient key-value storage.
Leverages disk seeking for fast data retrieval without the need for full in-memory loading.
Supports arbitrary-length values stored in values.bin.

How It Works

Data Assembly

Use the dat_writer.py script to process a tab-separated file (e.g. python dat_writer.py example_data.psv)

This creates 2 files that use direct-addressing to create a key-value mapping:

keys.bin: Maps integer keys to corresponding a byte position in values.bin.
values.bin: Stores associated values with length-prefixed encoding for variable-length data.

Querying

Use the key_lookup.py client to retrieve values by key. For example:

	$ python key_lookup.py 313
	Train like a warrior, think like a sage

For a given key k, the client will:

Seek to the k^th byte in keys.bin.
Retrieve the position of the corresponding value in values.bin.
Seek to the value position in values.bin and read the value.

Setup

Prerequisites

Python 3.x
Basic knowledge of Python file operations

Tradeoffs, Limitations, and Future Work

Tradeoffs

Disk I/O vs Memory Usage:
- This tool trades memory efficiency for disk I/O overhead. While it can handle datasets larger than available memory, disk seeks are inherently slower than in-memory operations. This makes it ideal for scenarios where memory is constrained but not for workloads requiring extremely high throughput.
Sparse Keys:
- If the key space is sparse (e.g., keys 1, 1000, 1000000), the size of keys.bin can grow significantly due to wasted space in unused positions. This tradeoff simplifies direct address computation but can lead to inefficient storage for certain datasets.
Lack of Indexing Flexibility:
- The current approach uses direct addressing. While this provides fast lookup, it lacks more advanced indexing features, such as range queries or ordered traversal, which are available in database systems or tree-based structures.

Limitations and Considerations for Future Work

Sequential Access:
- The tool is optimized for random access. If the workload primarily involves sequential access, other data structures (e.g., B-trees, hash maps) or databases (e.g., SQLite) may perform better.
Key Space Constraints:
- The tool assumes that the key space is bounded and uses direct mapping to byte positions. For unbounded or unpredictable key spaces, a hash-based or more dynamic approach may be necessary.
Performance on Slow Disks:
- Performance heavily depends on the underlying disk. Using HDDs instead of SSDs may significantly increase seek times, reducing overall efficiency.
Error Handling:
- Currently, the tool assumes well-formed input data and does not handle edge cases like corrupt files, missing keys, or invalid key lookups. These areas could be improved.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README.md		README.md
dat_writer.py		dat_writer.py
example_data.tsv		example_data.tsv
key_lookup.py		key_lookup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disk-Based Direct Address Table Toolkit

Overview

Features

How It Works

Data Assembly

Querying

Setup

Prerequisites

Tradeoffs, Limitations, and Future Work

Tradeoffs

Limitations and Considerations for Future Work

About

Releases

Packages

Languages

MichaelPGriffin/DAT_tools

Folders and files

Latest commit

History

Repository files navigation

Disk-Based Direct Address Table Toolkit

Overview

Features

How It Works

Data Assembly

Querying

Setup

Prerequisites

Tradeoffs, Limitations, and Future Work

Tradeoffs

Limitations and Considerations for Future Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages