Genomics ETL POC

This repository hosts a ETL pipeline designed for a toy dataset mimicking a genomics dataset.

It transforms the dataset into an analysable format that enables easy querying for the discovery of overlaps between sequences.

Assumptions

When cleaning this data, a number of assumptions are made:

The id column has no significance beyond being a database id
Duplicate entries have no significance
Sequence-type pairs are unique
When there are two start events, the earliest location is true
Events of type unclear_read signal that this sequence-type exists at this location
If the start event is less than the end event, treat them in reverse

To run the ETL pipeline with queries showing overlaps between sequence execute python genomics_etl/__init__.py from a terminal

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
genomics_etl		genomics_etl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
entity-relationship-diagram.png		entity-relationship-diagram.png
requirements.txt		requirements.txt
setup.py		setup.py