Gremlin, automatic use of indexes on where() steps #1168

marco-brandizi · 2023-07-12T11:44:04Z

marco-brandizi
Jul 12, 2023

Hi all,

I thought of asking ArcadeDB people on this issue, after having realised that maybe, the core of the problem is about using indexes with Gremlin.

To summarise it, I have edges that are 'ternary relations', ie, in addition to the usual outgoing/incoming vertexes, the relation has an 'evidence' attribute, which contains the ID of a third 'EvidenceType' vertex (which could describe something like: manually curated, imported from <ref>, text mining).

The intuitive query for this (see my own answer) is traversing the edges of interest, adding another traversal over V() and using where() to match the edge's attribute to the vertex ID in the second traversal. This is pretty similar to multiple JOINs in SQL or multiple MATCHes in Cypher. However, in ArcadeDB, it seems to be resolving it with a full scan of the evidence type, for each matched edge in the second part. Apparently, it doesn't use the indexes (neither the one on vertext ID, nor the one on the edge property), nor it caches already-found vertexes (on 100 edges, there are just 4-5 linked evidences).

I understand that's usually the way Gremlin is implemented, but I'd like to investigate more: is there a way to tell the Gremlin engine to use the indexes? Is there a way to write that query in pure Gremlin, without using the map-based approach that I discovered later (again see the last part of my answer)? That approach isn't good when the in-memory intermediate result that you need to accumulate is too big.

Am I missing something in Gremlin (I'm quite new to it)?

Note that "model it differently" isn't possible in this case, and also I'm investigating the potential of Gremlin with various data models, including this one, which occurs quite often in my application domain.

lvca · 2023-07-12T15:18:21Z

lvca
Jul 12, 2023
Maintainer

Hi @marco-brandizi, I'd like to check what's going on under the hood and why the index is not used. Is there a way where I can have a minimal database where to run your query and see what's going on? Even the database with 1 record for type would be enough.

13 replies

marco-brandizi Jul 15, 2023
Author

It's surprising, cause to me this is a quite common case and instead, it seems hard to describe here or on SO, in addition to the fact I can't find a similar case in Gremlin documentation.

Anyway, this is how the data look like I've tried to describe them in SO:

v1: Protein{prefName: 'QP1'} 
  -- r1: is_part_of{evidence: 'ns:testdb'} 
  --> v2: Protcmplx{prefName: 'P12 Complex'}
ev: EvidenceType{ iri = "ns:testdb", label = "Test Database" }

that is, each is_part_of edge links a Protein vertex to a Protcmplx vertex (ie, it's saying the protein is part of some aggregate of proteins), then, conceptually a third link goes from the same is_part_of edge to an EvidenceType vertex. The way the latter is modelled is that the edge's property 'evidence' is the same as the EvidenceType's property 'iri' (primary key for vertexes).

So, what I'm trying to do is listing tuples of Protein(prefName)/complex(prefName)/evidence(label).

So far, the only way I found to do so in Gremlin is:

either collect unique 'evidence' values from the edges and then use this smaller list to query EvidenceType(s) in a lambda
or, use two traversals, one for the part-of part and the other for EvidenceType and then join the two by matching the corresponding is_part_of.evidence/EvidenceType.iri.

The former is very fast, but it creates a map in-memory, so not very good if there are many expected evidences as results. The second is still slow, presumably cause it does a full scan of evidences for each is_part_of that it meets.

In Cypher, it works like this:

MATCH (cpx:Protcmplx) <- [ pr:is_part_of ] - (p:Protein), (ev:EvidenceType)
WHERE pr.evidence = ev.iri
RETURN pr, ev
LIMIT 10

which can be tried here (leave credentials blank) and it is fast enough, despite it's against a much bigger dataset.

Coming to your proposed changes (thank you so much, very instructive!):

inE('is_part_of') makes things a bit faster, but doesn't change much on the full-scan/cartesian that seem to be happening in where()
has() can't work as you say, cause I need to match the attributes, so I've tried this instead:

g.V().hasLabel ( 'Concept:Protcmplx:Resource' ).as ( 'cpx' )
  .inE('is_part_of').limit ( 10 ).as ( 'pr' )
  .outV ().hasLabel ( 'Concept:Protein:Resource' ).as ( 'p' )  
.V().hasLabel ( 'EvidenceType:Resource' ).as ( 'ev' )
  .has ( 'iri', select ( 'pr' ).by ( 'evidence' ) )
.select ( 'p', 'cpx', 'ev' )
.by ( 'prefName' )
.by ( 'prefName' )
.by ( 'label' )

But it's still around 1.2s for 10 results.

Thanks again!

lvca Jul 15, 2023
Maintainer

Sorry, I think I didn't read the whole SO thread then. Got it, you are building a ternary relationship by using the edges. With ArcadeDB you could put a link to the EvidenceType, so you don't have to do the join, but just traverse it from the edge. By default, an edge has @in and @out, but you could have multiple links to other vertices, edges, or documents.

Without such change, the same cypher query in ArcadeDB would be using a sub query:

select $resource, $relationship, $evidence
from `Concept:Protcmplx:Resource`
let relationship = inE('is_part_of'), 
    resource = $relationship.outV()[@type = 'Concept:Protein:Resource'],
    evidence = ( select from `EvidenceType:Resource` where iri in $parent.relationship.evidence )
limit 10

This takes about 27ms on my Mac (please update to the latest main branch to have the latest fix on studio).

But if you're able to convert the evidence property in the edge to the actual link to the Evidence (I guess during the import phase), then the query would be much faster and like this:

select $resource, $relationship, $evidence
from `Concept:Protcmplx:Resource`
let relationship = inE('is_part_of'), 
    resource = $relationship.outV()[@type = 'Concept:Protein:Resource'],
    evidence = ( select from $parent.relationship.evidence )
limit 10

marco-brandizi Jul 15, 2023
Author

That's very good to know, though for the moment I'm trying to assess how Gremlin is easy-to-use and performant, with data models that are common in life sciences. Essentially, I and other collaborators are working on an extension of this work. By the way, if you are interested, I'd be happy to collaborate on it with ArcadeDB people!

lvca Jul 17, 2023
Maintainer

Absolutely! Is the dataset the same as you shared? Do you have Neo4j's cypher queries to convert and give them a spin on ArcadeDB?

lvca Jul 17, 2023
Maintainer

Also, by looking at this diagram, you use edge types (like part_of and consumed_by) for connecting different vertex types.

This can be highly optimized in ArcadeDB by defining specific edge types:

create edge type part_of
create edge type transport_part_of extends part_of
create edge type reaction_part_of extends part_of

Then use transport_part_of for edges from the Transport vertex and reaction_part_of from Reaction. With polymorphism, you can still use the generic "part_of", but now you can traverse directly by the sub edge type instead of checking the target vertex. Ea

Like this part:

inE('is_part_of').outV()[@type = 'Concept:Protein:Resource']

Could be just written as:

inE('protein_is_part_of').outV()

If you use the protein_is_part_of only to map Protcmplx to Protein avoiding to fetch the vertex and checking for the type.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gremlin, automatic use of indexes on where() steps #1168

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 13 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Gremlin, automatic use of indexes on where() steps #1168

marco-brandizi Jul 12, 2023

Replies: 1 comment · 13 replies

lvca Jul 12, 2023 Maintainer

marco-brandizi Jul 15, 2023 Author

lvca Jul 15, 2023 Maintainer

marco-brandizi Jul 15, 2023 Author

lvca Jul 17, 2023 Maintainer

lvca Jul 17, 2023 Maintainer

marco-brandizi
Jul 12, 2023

Replies: 1 comment 13 replies

lvca
Jul 12, 2023
Maintainer

marco-brandizi Jul 15, 2023
Author

lvca Jul 15, 2023
Maintainer

marco-brandizi Jul 15, 2023
Author

lvca Jul 17, 2023
Maintainer

lvca Jul 17, 2023
Maintainer