Skip to content

Querying the GLUE Database

Robert J. Gifford edited this page Nov 25, 2024 · 10 revisions

GLUE allows powerful queries to be run on the stored data, using a few simple concepts and command patterns. These can be used simply to extract data subsets, to ask simple research questions, or as part of a more complex analysis. To understand this section it is useful to be familiar with the core schema and schema extension sections. We will also use the example project for illustration so it will be useful to have this installed.

  1. GLUE "where clauses"
  2. Property paths to specify table columns
  3. Property paths in FreeMarker templates


GLUE "where clauses"

Many GLUE commands have the same basic structure:

  1. Start from some particular set of candidate objects of the same type (i.e. in the same table)
  2. (Optionally) select a subset of candidate objects
  3. Perform some operation on the selected objects

GLUE "where" clauses are used in step 2 to select a subset of objects, using a logical filter. A command with this structure is the amino-acid frequency command. This command is executed in alignment mode, so a specific Alignment object is the data context.

Mode path: /
GLUE> project example
OK
Mode path: /project/example
GLUE> alignment AL_MASTER
OK
Mode path: /project/example/alignment/AL_MASTER
GLUE> amino-acid frequency -c -w "sequence.source.name = 'ncbi-hev-examples'" -r REF_MASTER_M73218 -f MT -l 56 66
+=======+===========+============+============+
| codon | aminoAcid | numMembers | pctMembers |
+=======+===========+============+============+
| 56    | V         | 10         | 100.00     |
| 57    | F         | 10         | 100.00     |
| 58    | R         | 10         | 100.00     |
| 59    | P         | 10         | 100.00     |
| 60    | E         | 10         | 100.00     |
| 61    | V         | 10         | 100.00     |
| 62    | L         | 7          | 70.00      |
| 62    | F         | 3          | 30.00      |
| 63    | W         | 10         | 100.00     |
| 64    | S         | 1          | 10.00      |
| 64    | N         | 9          | 90.00      |
| 65    | H         | 10         | 100.00     |
| 66    | P         | 10         | 100.00     |
+=======+===========+============+============+

In this case the Alignment AL_MASTER is the data context, and the command operates as follows:

  1. The candidate object set consists of all AlignmentMembers which belong to Alignment AL_MASTER, or (since -c was used) one of its descendent Alignments in the alignment tree
  2. Select a subset of these AlignmentMembers which pass the where clause filter specified by -w "sequence.source.name = 'ncbi-hev-examples'"
  3. Amongst the selected AlignmentMembers, compute the frequency of amino acid residues in the MT (Methyltransferase) genome region at ORF1 codon positions 56 to 66 inclusive (as specified by the -r, -f and -l options)

All the objects identified in step 1 will be tested in step 2 against the where clause; those which pass this filter will then contribute to the computation in step 3.

The where clause string sequence.source.name = 'ncbi-hev-examples' is supplied in double quotes in order to allow space and single quote characters to be used within the string.

The first part of the where clause is sequence.source.name. This is a property path. Starting from the type of the candidate object, in this case AlignmentMember, the property path traverses named relational links specified by the data schema in order to arrive at a data value. In this case the all links are specified in the core schema; the relevant schema fragment is shown. However any or all of them could also be specified by a schema extension.

  • Start from the AlignmentMember candidate object type
  • Traverse the sequence relational link to the target Sequence object
  • Traverse the source relational link to the target Source object
  • Access the value for the field name on the Source object

The where clause expression uses the binary operator '=' to compare the Source name with the literal string 'ncbi-hev-examples', so that only AlignmentMembers based on Sequences belongs to Source ncbi-hev-examples are selected. It is also possible to use parentheses and logical operators such as 'and', 'not' and 'or' to join sub clauses together. Here are some where clause patterns you could use in the example project:

Candidate object type Where clause Meaning Note Example command
Sequence length >= 7200 and length < 7300 The field value for length is at least 7200 and less than 7300 Numeric comparison operators may be used for INTEGER fields such as length, DOUBLE or DATE fields list sequence
Sequence length between 7200 and 7300 The field value for length is at least 7200 and at most 7300 The between ... and numeric comparison operator is for testing values within an inclusive range list sequence
m49_country is_ldc = true or is_lldc = true Country is classified as least developed country or landlocked developing country BOOLEAN fields such as is_ldc and is_lldc are tested for value true list custom-table-row
m49_country m49_region.id = 'europe' and m49_sub_region.id != 'northern_europe' Country region is Europe but sub-region is not Northern Europe The != comparison operator, which may be applied to field values of any type, is used for disequality, and the and logical operator conjoins two sub-clauses list custom-table-row
Sequence gb_create_date > #gluedate(01-Apr-2009) First submitted to GenBank after 1st April 2009 The #gluedate syntax specifies a concrete date, then numeric comparison operators may be used list sequence
Sequence m49_country.id in ('CHN', 'JPN') Country of origin China or Japan The in syntax allows comparison with any one of a fixed set of strings list sequence
Sequence sequenceID like 'FJ%' The sequenceID field value starts with FJ The like operator my be used for String fields such as sequenceID. If so, the '%' character acts as a wildcard, matching any number of characters. list sequence
ReferenceSequence alignmentsWhereRefSequence.name != null ReferenceSequence is the constraining reference sequence of at least one Alignment The alignmentsWhereRefSequence link is specified in one of the core schema diagrams. When traversing from ReferenceSequence to Alignment the link allows multiple targets, so the where clause is effectingly requiring that there exists at least one link target list reference
Sequence source.name = 'ncbi-hev-examples' and m49_country+.id = null Within Source ncbi-hev-examples but M.49 country is undefined The keyword null can be used to test for an undefined value in the database. In this case we must also use the outer join operator + on link m49_country since by default sub-clauses which traverse to a non-existent link target will return false. The operator allows such sub-clauses to evaluate to null list sequence

For those interested in implementation details, where clause strings are actually Apache Cayenne object path expressions.


Property paths to specify table columns

The property path concept introduced above has other uses in GLUE. One example use is to configure the result columns of various "list ..." commands such as list custom-table-row, list sequence or list member. Such commands output a table where each row is an instance of a specific given object type. Each such command has a default set of columns, for example list member will give alignment.name, sequence.source.name and sequence.sequenceID. Note that these column headings are themselves property paths starting at the AlignmentMember object type. When invoking this command, you can supply your own property paths, overriding the columns in the result table, for example:

GLUE> alignment AL_MASTER
OK
Mode path: /project/example/alignment/AL_MASTER
GLUE> list member --recursive --whereClause "sequence.source.name = 'ncbi-hev-examples'" sequence.sequenceID alignment.name sequence.host_species
+=====================+================+=======================+
| sequence.sequenceID | alignment.name | sequence.host_species |
+=====================+================+=======================+
| AB481226            | AL_3e          | -                     |
| AB591734            | AL_3a          | Herpestes javanicus   |
| AF444003            | AL_1b          | -                     |
| FJ705359            | AL_3c          | Sus scrofa            |
| FJ763142            | AL_4a          | Homo sapiens          |
| FJ998015            | AL_3e          | Sus scrofa            |
| JF443717            | AL_1c          | Homo sapiens          |
| JQ013791            | AL_3ra         | Oryctolagus cuniculus |
| JX855794            | AL_4b          | Sus scrofa            |
| KP294371            | AL_3i          | Sus scrofa            |
+=====================+================+=======================+
AlignmentMembers found: 10


Property paths in FreeMarker templates

Some GLUE modules use FreeMarker templates to assemble text strings. In some cases the model for the template may be a GLUE data object. Property paths may then be used to access the database from within the FreeMarker template.

A simple example is fastaAlignmentExporter. The ID string for each alignment row is formed using a FreeMarker template. The default behaviour is to use the following FreeMarker template to create an ID with three sections:

${alignment.name}.${sequence.source.name}.${sequence.sequenceID}

Each of the three sections is created by evaluating a property path from the AlignmentMember object that contributed the alignment row.


Clone this wiki locally