Why Some Gremlin Queries Run Faster than Others

Intro

With great power comes great responsibility.

--Spiderman's Uncle

The Gremlin language gives users great power in the form of traversal expressivity. With the dozens of steps available in the Gremlin language, developers can make their little Gremlin(s) jump around the graph every which way they like. However, not all paths that reach the same result are equal.

This article discusses Gremlin query optimization for DSE Graph. We will walk through examples of queries that just do the job, and improve upon them to get better performance while focusing on the reasoning and introspection tools required to identify bottlenecks and measure improvements. With this guidance, readers can reproduce comparable query tuning results in their own environments and achieve performant real-time transactional Gremlin queries.

Note: For the purposes of this article, a good query is one that gets the expected result. These "good" queries will be inefficient but we use them to highlight common mistakes and the techniques used to avoid them.

One Index at a time

DSE Graph queries use one index at a time. When designing a graph query, be conscious of the fact that DSE Graph’s current implementation allows it to take advantage of one index per query. Once that index is used up, the remaining execution will happen in memory at the DSE Graph server in the DSE JVM.

Grab the right index

Take the following "good" query:

g.V().has('state','name', 'california').
  in('lives_in').
  hasLabel('person').
  has('first_name', regex('Seb.+')).
  has('last_name', regex('Est.+'))

In the query above, we are performing regex matches on first_name and last_name (properties on the person vertex) and an exact match on the state name california. Since we are able to leverage one search index (the index on the state vertex or the index on the person vertex) we should consider which filter will return fewer values and perform that filter first.

For example, if we are expecting just a few individuals with names starting with Seb / Est in the dataset, would make sense to write a better query as follows:

g.V().hasLabel('person').  
  has('first_name', regex('Seb.+')).
  has('last_name', regex('Est.+')).as('people').
  out('lives_in').
  has('state','name', 'california').
  select('people')

Notice the use of the as / select steps to ensure that our result set still contains person vertices and not state vertices. Using as / select enables path tracking, which can be computationally expensive so an even better way (best) to write this query would be as follows:

g.V().hasLabel('person').  
  has('first_name', regex('Seb.+')).
  has('last_name', regex('Est.+')).
  where(
    out('lives_in').
    has('state','name', 'california')
)

The where step (subfilter) allows us to return the relevant person vertices without enabling path tracking.

Another way to tackle this 'order of traversal problem' is to leverage the Gremlin match step.

Splitting queries

Note: The following is one of the cases where as the DSE Graph optimizer improves; the difference between good, better, and best queries vanishes. The .or condition will combine the constraints into a single search query in the latest DSE so there is no need to manually optimize!

Sometimes which filter is more selective is not obvious and it makes sense to perform two queries instead of one.

In versions of DSE older than 5.1.3, this is true when an .or condition is present.

In the following "good" query, we have multiple filters (based off of different vertices) but DSE Graph can only execute the query in scan mode (There is a ticket to fix .or handling so this is only a near term problem).

g.V().  
  or(
    out('lives_in').has('name', 'california'),
    has('person','name', regex('Seb.+')),
    has('person', 'company', 'DataStax')
  ).
  values('first_name')

Another (better) way to obtain this result set would be to run two separate queries and glue them together app side:

g.V().has('state','name','california').
  in('lives_in').values('first_name')

g.V().has('person','first_name', regex('Seb.+')).
  has('company', 'DataStax').
  values('first_name')

Notice only two queries (not three) are necessary, because the first name and the works for filters can leverage the same search index on the person vertex.

We can also run the following Groovy statements to obtain the result set in a single call to the database (best).

def names = g.V().has('state','name','california').
  in('lives_in').values('first_name').toSet()


names += g.V().has('person','first_name', regex('Seb.+')).
  has('company', 'DataStax').
  values('first_name').toSet()

Paging in Gremlin

For queries that expect large result sets paging is required. We can accomplish paging by using the .range step. However, the .range step’s latency will degrade with the max value in the return parameters (not the number of values being returned). This is a "good" query in that it will eventually get you the results you need (as long as it doesn't time out!):

g.V().has('person', 'company', 'DataStax').
  where(out('lives_in').has('name','california')).
  has('first_name', regex('S.+')).
  range(0,1).values('first_name')

In the query above, we read 1010 values from DSE and throw 1000 of them away in-memory at the Graph server.
As an (best) alternative to .range for paging, we can use a search index on the person_id field for paging with DSE Search.

Note: We chose person_id because we needed a sortable and unique field we could use to page against. person_id is an integer and integers in DSE Search are indexed as Lucene TrieIntFields. Trie fields in Lucene leverage the trie data structure to allow range filtering and sorting.

In your app, hang on to the max value from the previous page:

def max = 1000

def nameSet = g.V().has('person', 'company', 'DataStax').
  where(out('lives_in').has('name', 'california')).
  has('first_name', regex('S.+')).values('person_id').toSet()


g.V().has('person','person_id', within(nameSet)).
  has('person_id', gt(max)).
  order().by('person_id').range(0,10)

Note: The Gremlin timeLimit steps allows you to set a limit on the time the Gremlin server will spend evaluating a request. When troubleshooting queries where the result set may be large, the timeLimit step is a great way to speed up your itterations.

Conclusion

With Gremlin, DSE Graph provides a powerful interaction mechanism capable of turning graph data into insights. Having come up with queries that work, developers can spend time optimizing them to ensure performance that falls within their SLAs.

Remember that as the optimizer in DSE Graph matures, some of these optimizations will start to happen on their own and the difference between a good, a better, and a best Gremlin query will shrink and may eventually disappear.