-
Notifications
You must be signed in to change notification settings - Fork 0
Schema Extensions
GLUE's core schema is designed to organize virus nucleotide data, serving as the foundation for the standard, built-in analysis functionality across all GLUE projects. While the core schema is essential, many analyses require additional, project-specific data items that are not part of the core schema. These auxiliary data objects often have structured relationships with the core schema, enhancing the value of the nucleotide data.
To accommodate these auxiliary data, GLUE allows schema extensions. Schema extensions enable the addition of new fields to existing objects, the creation of custom tables (new object types), and the establishment of custom relational links between tables. This flexible mechanism allows users to extend the database schema on a per-project basis, using a streamlined set of concepts and commands.
Schema extensions in GLUE can be applied to model various types of data, including:
- Host species of the virus, grouped taxonomically.
- Patient-related data, such as sequences collected from the same patient at different times.
- Associations between sequences originating from the same sample, e.g., different segments of a segmented virus.
The Dengue-GLUE project illustrates the use of schema extensions. In this project, several custom fields are added to the standard GLUE Sequence table. Additionally, each Sequence can be linked to a country of origin, a custom object type defined within the schema extension. Countries are further classified into geographical regions, based on the United Nations M49 standard. This standard, also known as the "UN M49," assigns numeric codes to countries, areas, and regions for statistical analysis and comparison. These codes are widely used in international data exchange, statistical reporting, and analysis. The M49 classification includes:
- Country codes: Numeric codes assigned to individual countries.
- Area codes: Codes assigned to regions or areas that may encompass multiple countries.
- Regional groupings: Codes for groups of countries based on geographical, economic, or other criteria.
For example, the United States has an M49 code of 840, and the European Union has a code of 155. These codes facilitate the organization and comparison of statistical data across different countries and regions. Incorporating the M49 standard into GLUE projects allows for standardised analyses of virus sequence variation by global region of isolation.
The Entity-Relationship diagram below demonstrates the schema extension used in Dengue-GLUE.
The key elements involved in working with schema extensions are:
- Defining custom tables and fields
- Populating custom tables and fields
- Querying custom tables and fields
- Defining custom relational links
- Source and destination link names
- Populating link targets
- Querying link targets
- Bulk population of schema extensions
A project's schema is extended by executing commands in the Schema mode. In the Dengue-GLUE project, all schema extension commands are executed from the exampleSchemaExtensions.glue
file, with M.49-related extensions handled in the m49SchemaExtension.glue
file.
For example, to define custom tables for the regions, sub-regions, intermediate regions, and countries as per M.49, the following commands are used:
create custom-table m49_region
create custom-table m49_sub_region
create custom-table m49_intermediate_region
create custom-table m49_country
Each custom table can contain data objects that exist alongside standard GLUE objects like Sequences and Alignments.
Custom tables must have an ID field, which serves as the unique identifier for each row. Additional fields are created using the create field
command. For instance, the m49_country
table might be defined as follows:
table m49_country
create field m49_code INTEGER
create field display_name VARCHAR 100
create field full_name VARCHAR 100
create field is_ldc BOOLEAN
create field is_lldc BOOLEAN
create field is_sids BOOLEAN
create field development_status VARCHAR 20
exit
Custom fields can also be added to standard GLUE object types, using the same create field
command within the Table mode. In Dengue-GLUE, several custom fields are added to the Sequence table:
The Dengue-GLUE project contains a schema extension related which we will use for illustration. Firstly, various custom fields are added to the standard GLUE Sequence table. In addition Sequences may each be linked with a country of origin, which is defined as a custom object type. Countries are classified and grouped into geographical regions (also custom objects) based on the United Nations M.49 system. This could be used for example to analyse sequence variation in association with global region of origin. The schema extension is shown in the Entity-Relationship diagram below.
table sequence
create field gb_gi_number VARCHAR
create field gb_primary_accession VARCHAR
create field gb_accession_version VARCHAR
create field gb_locus VARCHAR
create field gb_organism VARCHAR
create field gb_create_date DATE
create field gb_update_date DATE
create field gb_taxonomy VARCHAR 200
create field gb_pubmed_id VARCHAR
create field length INTEGER
create field isolate VARCHAR
create field host_species VARCHAR 200
create field genotype VARCHAR 10
create field subtype VARCHAR 10
exit
Once defined, custom tables and fields can be populated with data using GLUE commands. For example, to add a new country "Ruritania" (ISO code RUR) to the m49_country
table:
GLUE> project example
GLUE> create custom-table-row m49_country RUR
GLUE> custom-table-row m49_country RUR
GLUE> set field full_name 'Kingdom of Ruritania'
GLUE> set field m49_code 998
GLUE> set field display_name Ruritania
GLUE> set field development_status developed
You can query custom objects and fields using commands such as show property
and list property
. For example, to query the properties of Ruritania:
GLUE> show property development_status
GLUE> list property
To list custom objects, like M.49 sub-regions, and configure result columns:
GLUE> list custom-table-row m49_sub_region id m49_code display_name
Relational links model relationships between objects, such as linking a Sequence to its country of origin. Links are created using the create link
command, specifying the source and destination tables:
create link sequence m49_country --multiplicity MANY_TO_ONE
Additional links can be defined, such as:
create link m49_country m49_region --multiplicity MANY_TO_ONE
create link m49_country m49_sub_region --multiplicity MANY_TO_ONE
These links align with the associations shown in the schema extension's entity-relationship diagram.
Relational links have source and destination link names, which are used for referencing the link in commands. For example, to filter sequences by country of origin:
GLUE> list sequence --whereClause "m49_country.m49_region.id = 'europe'" sequenceID m49_country.display_name
This command selects sequences where the country belongs to the Europe region, displaying their sequence IDs and country names.
Once a link is defined, objects in the project database can be associated via the link using specific GLUE commands in custom object mode.
The set link-target
command associates an object with the "ONE" end of a link. For example, to associate the country Ruritania with the Eastern Europe sub-region:
Mode path: /project/example
GLUE> custom-table-row m49_country RUR
OK
Mode path: /project/example/custom-table-row/m49_country/RUR
GLUE> set link-target m49_sub_region custom-table-row/m49_sub_region/eastern_europe
OK
(1 CustomTableObject_example_m49_country updated)
GLUE does not enforce any minimum multiplicity on custom link targets, so objects can have no associated link target. To undo an association, use the unset link-target
command:
Mode path: /project/example/custom-table-row/m49_country/RUR
GLUE> unset link-target m49_sub_region
OK
(1 CustomTableObject_example_m49_country updated)
The add link-target
command adds an object to the "MANY" end of a link. For instance, to associate the Eastern Europe sub-region with Ruritania:
Mode path: /project/example
GLUE> custom-table-row m49_sub_region eastern_europe
OK
Mode path: /project/example/custom-table-row/m49_sub_region/eastern_europe
GLUE> add link-target m49_country custom-table-row/m49_country/RUR
OK
(1 CustomTableObject_example_m49_sub_region updated)
To display the target on the "ONE" end of a link, use the show property
command:
Mode path: /project/example/custom-table-row/m49_country/RUR
GLUE> show property m49_sub_region
propertyValueResult
property: m49_sub_region
value: custom-table-row/m49_sub_region/eastern_europe
To list the targets on the "MANY" end of a link, use the list link-target
command:
Mode path: /project/example/custom-table-row/m49_sub_region/eastern_europe
GLUE> list link-target m49_country id display_name
+=====+==============+
| id | display_name |
+=====+==============+
| BGR | Bulgaria |
| BLR | Belarus |
| CZE | Czechia |
| HUN | Hungary |
| MDA | Moldova |
| POL | Poland |
| ROU | Romania |
| RUR | Ruritania |
| RUS | Russia |
| SVK | Slovakia |
| UKR | Ukraine |
+=====+==============+
CustomTableObject_example_m49_countrys found: 11
Alternatively, you can achieve a similar result using the list custom-table-row
command with a --whereClause
:
Mode path: /project/example
GLUE> list custom-table-row m49_country --whereClause "m49_sub_region.id = 'eastern_europe'" id display_name
While fine-grained commands are useful for testing and small-scale operations, bulk population of field values and link targets is often more practical for larger datasets.
The multi-set field
command allows setting a field value across a group of objects selected by a whereClause
.
Field values and link targets can be populated from GenBank XML using the genbankXmlPopulator
module or from tabular data using the textFilePopulator
module.
GLUE's scripting layer provides flexibility for bulk operations. For example, the following JavaScript program populates the m49_sub_region
table using a tabular data file:
var subRegions;
glue.inMode("module/exampleTabularUtilityCsv", function() {
subRegions = glue.tableToObjects(glue.command(["load-tabular", "tabular/m49_countries/m49_sub_regions.csv"]));
});
_.each(subRegions, function(subRegion) {
var subRegionName = subRegion["Sub-region Name"].trim();
var subRegionId = subRegionName.toLowerCase().replace(/[ -]/g, "_");
var m49Code = subRegion["Sub-region Code"].trim();
var regionName = subRegion["Region Name"].trim();
var regionId = regionName.toLowerCase();
glue.command(["create", "custom-table-row", "m49_sub_region", subRegionId]);
glue.inMode("custom-table-row/m49_sub_region/"+subRegionId, function() {
glue.command(["set", "field", "display_name", subRegionName]);
glue.command(["set", "field", "m49_code", m49Code]);
glue.command(["set", "link-target", "m49_region", "custom-table-row/m49_region/"+regionId]);
});
});
This approach allows you to automate and efficiently populate schema extensions, ensuring consistency and reducing the potential for manual errors.