As part of the EJP-RD and as implemented in Solve-RD we are developing a metadata database to track and find samples processed by CNAG and submitted to the EGA.
The core model for this database is designed to store sample, subject and file metadata. Using existing standards.
All entities are:
- Study - Container for all activities. Contains datasets
- Organisation - Organisation involved in the study
- Subject - Human subjects, typically patients or family members
- SubjectInfo - Extra information about subject
- Sample - Samples used as input for the analysis
- File - Individual files on file systems so we can find them back, linked to the datasets describing them.
- Filetype - type of files (e.g. BAM, gVCF, phenopacket, BED, etc.)
- Person - Researcher or other person involved in the study
- Job - Jobs used to process sampledata
- Run - Container of jobs
- Dataset - Collection of files, collected in context of a study. Could also call this a 'fileset' if we like that better
- Publication - Publication linked to subject and/or variant
- LabInfo - Information of process in lab (barcodes, sequencer,etc)
- SequencingTechniqueType - Sequencing technique types (in CNAG batchfile = ExpType)
- Variant - Identifier of an allele/genotype (HGVS)
- VariantTypes - Sequence variant types
- ClinicalClassification - Clinical Classification (1,2,3,4,5)
- GenomeBuild - Human reference sequence used in UCSC
- Library - Information for library used in experiment
- Library Source - Library Source, e.g Genomic/Transcriptomic
- European Reference Networks - European Reference Networks, source: https://ec.europa.eu/health/ern/networks_en
- Tissue Types - TissueTypes, source is GTeX; https://www.gtexportal.org/home/tissueSummaryPage
CodeList (Ontologies)
- anatomicalLocation - Code list for anatomicalLocation used for sampling. E.g. Blood
- dataUseConditions - Code list describing different types of conditions to access the data
- disease - ICD-10 codes example_data; from C00 till C06.2
- materialType - Code list for materialType, e.g. DNA
- phenotype - Code list for phenotype, e.g. HPO term
- Sex - code list for sex. E.g. 'M'
The default import format for MOLGENIS is 'EMX'. This is a flexible spreadsheet format (Excel, CSV) that allows you to annotate your data with a data model. This works because you can tell MOLGENIS the 'model' of your data via a special sheet named 'attributes'.
Entities not in use:
- Relation - Family entity relationship
- Disease inheritance - Description of known inheritance linked to disease and possibly mutation
- VariantInfo - Extra information about variant