tinkerpop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marko A. Rodriguez (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TINKERPOP-1250) Support Subgraph-Centric GraphComputer
Date Wed, 06 Apr 2016 17:14:25 GMT
Marko A. Rodriguez created TINKERPOP-1250:

             Summary: Support Subgraph-Centric GraphComputer
                 Key: TINKERPOP-1250
                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1250
             Project: TinkerPop
          Issue Type: New Feature
          Components: process
    Affects Versions: 3.2.0-incubating
            Reporter: Marko A. Rodriguez

Right now, {{GraphComputer}} and {{VertexPrograms}} are "vertex-centric." That is, the boundary
of an atomic unit of computation is a single vertex, its properties, and its incident edges.
We should support "subgraph-centric" computing where each worker's partition is loaded in
memory (RAM) as a connected subgraph. For those vertices that are not in its partition, a
special "reference vertex" is use to reference it. What this would mean is that when a {{Traverser}}
is processing, it can continue to evaluate as long as its within the subgraph partition. The
moment it references a vertex not in the partition (a "reference vertex"), it serializes itself
as a message.

This would greatly increase the speed of Gremlin OLAP at the cost of requiring a large amount
of memory to store all the worker partition subgraphs in RAM. There might even be a way to
have a hybrid-model where some of the partition is held in RAM and the other is (even though
still in the same partition) is stored as "star vertices."

How would this be added to {{GraphComputer}} in a backwards compatible way?

GraphComputer.supportsVertexCentricComputing)() // currently true for all implementations

A {{VertexProgram}} could then have the following method:

boolean VertexProgram.withinCentricity(final M message)

If the message is NOT within "centricity", then it is serialized and distributed, else it
continues to execute.

I haven't thought through all the API and implementation considerations. Though it would be
good to make this backwards compatible and, moving forward, able to support "edge-centric
computing" and thus, have a very memory limited OLAP system.

How to think of the different models:

1. Vertex-centric: medium speed, medium expressivity, medium memory costs.
2. Subgraph-centric: high speed, high expressivity, high memory costs.
3. Edge-centric: high speed, low expressivity, low memory costs.

What is "expressivity"? Well, subgraph-centric computations can have "local traversals" that
move beyond the "star graph." Edge-centric would not support any `by()`-modulators as the
computation is bound to a single edge.  Thus, low expressivity. It really depends on how you
represent the edge-centric edge list. Do you have the vertices on each side of an edge duplicated
with all their properties? This wold still be low memory costs, but you could get more expressivity.

Anywho, in general it would be nice if the underlying execution engine can handle the three
common distributed graph computing paradigms. Subgraph-centric seems the easiest to support
at this point in time.

This message was sent by Atlassian JIRA

View raw message