This repository contains the code and metadata needed to build a Knowledge Graph (KG) for NASA GeneLab omics datasets hosted on the Open Science Data Repository (OSDR).
- Automated graph construction from datasets in the OSDR
- Incremental update for new datasets
- Statistical filtering of results for significance
- Species selection via a configurable whitelist
- Versioned metadata for reproducibility (v0.0.3)
- Federated query using Neo4j Fabric with the Scalable Precision Medicine Open Knowledge Engine (SPOKE) KG
Measurement | Technology | Property | Selection Criteria |
---|---|---|---|
Transcription profiling | RNA Sequencing (RNA‑Seq) | Log2 fold change | Adjusted p-value <= 0.05 |
Transcription profiling | DNA microarray | Log2 fold change | Adjusted p-value <= 0.05 |
DNA methylation profiling | Whole Genome Bisulfite Sequencing | Methylation difference % | q-value <= 0.05 |
DNA methylation profiling | Reduced‑Representation Bisulfite Sequencing (RRBS) | Methylation difference % | q-value <= 0.05 |
- Fetch omics study records using the OSDR API
- Filter datasets by statistical thresholds and target species
- Map model organism genes to human genes
- Map cell and tissue types to the Cell (CL) and Uber Anatomy Ontology (UBERON) ontology, respectively
- Export CSV files for graph database upload
- Import CSV files into a Neo4j Graph database
Figure: Schematic overview of the GeneLab knowledge graph structure, highlighting key node types (circles) and relationships (arrows).
The Assay–MEASURED–MGene
relationship encodes Log₂ fold changes derived from transcription profiling assays, while the Assay–MEASURED–MethylationRegion
relationship captures methylation differences identified through DNA methylation assays. The MGene–METHYLATED_IN–MethylationRegion
relationship links model organism genes (MGene
) to 1,000 base pair genomic regions (MethylationRegion
) exhibiting differential methylation.
Proxy nodes (shown in gray) represent standardized identifiers for human genes (ENTREZ ID), anatomical structures (UBERON ID), and cell types (CL ID), enabling integration with external Neo4j databases and supporting composite graph database construction.
Diagram generated using arrows.app.
The following node and relationship metadata files define the graph schema.
-
Relationships
kg/v0.0.3/metadata/relationships/
The organization and conventions for defining the metadata and data are described in the kg-import Git repository.
Figure: Integration of the SPOKE and GeneLab knowledge graphs using proxy nodes.
The GeneLab graph (right), a knowledge graph representing spaceflight omics datasets, depicts key experimental entities: Assay
, Study
, Mission
, MGene
, and MethylationRegion
, along with their relationships.
Proxy nodes (gray) represent external identifiers (ENTREZ, UBERON, CL) and enable linkage to the SPOKE graph (left), a rich biomedical knowledge graph comprising biological processes, molecular functions, diseases, compounds, and more. The dashed lines indicate mappings to enable the construction of a composite Neo4j graph database. The composite graph enables federated queries across multiple KGs.
-
Download the Neo4j Desktop application from the Neo4j Download Center and follow the installation instructions.
-
When the installation is complete, Neo4j Desktop will launch. Click the
New
button to create a new project.
- Hover the cursor over the created project, click the edit button, and change the project name from
Project
tospoke-genelab
.
- Click the
ADD
button and selectLocal DBMS
. Select Neo4j version 5.23.0.
- Enter the password
neo4jdemo
and clickCreate
.
- Select
Terminal
to open a terminal window.
- Type
pwd
in the terminal window to show the path to theNEO4J_INSTALL_PATH
directory. This path is required in the.env
file, see the next section.
Prerequisites: Miniconda3 (light-weight, preferred) or Anaconda3 and Mamba (faster than Conda)
- Install Miniconda3
- Update an existing miniconda3 installation:
conda update conda
- Install Mamba:
conda install mamba -n base -c conda-forge
- Install Git (if not installed):
conda install git -n base -c anaconda
- Clone this Repository
git clone https://github.com/BaranziniLab/spoke_genelab.git
cd spoke_genelab
- Create a Conda environment
The file environment.yml
specifies the Python version and all required dependencies.
mamba env create -f environment.yml
-
Create an account in BioPortal and copy the API key. BioPortal is used to map terms to ontologies.
-
Copy the file
env_template
to.env
-
Edit the file
.env
and set the following variables
KG version number
KG_VERSION=v0.0.3
Path to the cloned git repository
KG_GIT=/Users/.../spoke_genelab/
Path to the Neo4J instance in Neo4j Desktop (in quotes). Make sure to enclose the path in quotes.
NEO4J_INSTALL_PATH="/Users/.../Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-3d4b95d1-0219-480b-a3c4-ee5a409cc383"
BioPortal API Key
BIOPORTAL_API_KEY=<bioportal api key>
- Start the spoke-genelab Graph DBMS
- Activate the conda environment
conda activate spoke-genelab
- Launch Jupyter Lab
jupyter lab
- Navigate to the
notebooks
directory and run the following notebooks
Notebook | Description |
---|---|
1_download_datasets.ipynb | Downloads datasets |
2_create_study_mission_nodes.ipynb | Creates Study and Mission nodes and their relationships |
3_create_gene_nodes.ipynb | Creates MGene (model organism) and mapped Gene (human) gene nodes |
4_create_assay_nodes.ipynb | Creates Assay nodes and their relationships |
5_import_to_neo4j.ipynb | Imports the formatted data into a Neo4j KG |
6_query_examples.ipynb | Runs example queries (optional) |
- When the import is completed, click the
Refresh
button in Neo4j Desktop. The newly created databasespoke-genelab-v0.0.3
will be listed.
- Click the
Open
button to launch the database.
- Click on the database icon on the left.
- Use the pull-down menu to select a version of
spoke-genelab-v0.0.3
database. Wait for about 30+ seconds until the database is loaded and the nodes are listed as shown below.
- Set the Graph Stylesheet
Drag the file kg/v0.0.3/style.grass onto the Neo4j Browser window to set the node colors, sizes, and labels.
-
Now you are ready to run Cypher queries on the selected database.
-
When you are finished, stop the database in the Neo4j Desktop.
To stop the conda environment, type
conda deactivate
-
Stop the database
-
Hover the cursor over the
spoke-genelab-v0.0.3
database and selectDump
from the menu.
- When the dump is complete, click the
Reveal files in Finder
button to open the directory that contains thespoke-genelab-v0.0.3.dump
file.
This database dump will be used to create the SPOKE-GeneLab composite database.
PW Rose, CA Nelson, SG Gebre, K Soman, KA Grigorev, LM Sanders, SV Costes, SE Baranzini, NASA SPOKE-GeneLab Knowledge Graph. Available online: https://github.com/BaranziniLab/spoke_genelab (2025)
CA Nelson, PW Rose, K Soman, LM Sanders, SG Gebre, SV Costes, SE Baranzini, Nasa Genelab-Knowledge Graph Fabric Enables Deep Biomedical Analysis of Multi-Omics Datasets, https://ntrs.nasa.gov/citations/20250000723 (2025)
NSF Award number 2333819, Proto-OKN Theme 1: Connecting Biomedical information on Earth and in Space via the SPOKE knowledge graph.