This repository hosts a ETL pipeline designed for a toy dataset mimicking a genomics dataset.
It transforms the dataset into an analysable format that enables easy querying for the discovery of overlaps between sequences.
When cleaning this data, a number of assumptions are made:
- The
id
column has no significance beyond being a database id - Duplicate entries have no significance
- Sequence-type pairs are unique
- When there are two
start
events, the earliestlocation
is true - Events of type
unclear_read
signal that this sequence-type exists at this location - If the start event is less than the end event, treat them in reverse
- Install Miniconda
- Create an environment
conda create -n genomics_etl python=3.9 -y
- Activate your new environment
conda activate genomics_etl
- Install this package in editable mode
pip install -e .
To run the ETL pipeline with queries showing overlaps between sequence execute
python genomics_etl/__init__.py
from a terminal