spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kushal Datta <kushal.da...@gmail.com>
Subject Re: Implementing TinkerPop on top of GraphX
Date Thu, 20 Nov 2014 18:59:02 GMT
I have also added a graphx-gremlin module in the Tinkerpop3 codebase. Right
now a GraphX graph can be instantiated from the Gremlin command line (in a
similar manner a Giraph graph is instantiated) and the g.V().count()
function calls the count() method on RDDs.
Please check out the code in:
https://github.com/kdatta/tinkerpop3/tree/graphx-gremlin

@Kyle, I'm off for a few days till Thanksgiving. After that I'll try the
EdgeIterator in this code.

Thanks,
-Kushal.

On Tue, Nov 18, 2014 at 2:23 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote:

> The new Tinkerpop3 API was different enough from V2, that it was worth
> starting a new implementation rather then trying to completely refactor my
> old code.
> I've started a new project: https://github.com/kellrott/spark-gremlin
> which compiles and runs the first set of unit tests (which it completely
> fails). Most of the classes are structured in the same way they are in the
> Gigraph implementation. There isn't much actual GraphX code in the project
> yet, just a framework to start working in.
> Hopefully this will keep the conversation going.
>
> Kyle
>
> On Fri, Nov 7, 2014 at 11:17 AM, Kushal Datta <kushal.datta@gmail.com>
> wrote:
>
>> I think if we are going to use GraphX as the query engine in Tinkerpop3,
>> then the Tinkerpop3 community is the right platform to further the
>> discussion.
>>
>> The reason I asked the question on improving APIs in GraphX is because
>> why only Gremlin, any graph DSL can exploit the GraphX APIs. Cypher has
>> some good subgraph matching query interfaces which I believe can be
>> distributed using GraphX apis.
>>
>> An edge ID is an internal attribute of the edge generated automatically,
>> mostly hidden from the user. That's why adding it as an edge property might
>> not be a good idea. There are several little differences like this. E.g. in
>> Tinkerpop3 Gremlin implementation for Giraph, only vertex programs are
>> executed in Giraph directly. The side-effect operators are mapped to
>> Map-Reduce functions. In the implementation we are talking about, all of
>> these operations can be done within GraphX. I will be interested to
>> co-develop the query engine.
>>
>> @Reynold, I agree. And as I said earlier, the apis should be designed in
>> such a way that it can be used in any Graph DSL.
>>
>> On Fri, Nov 7, 2014 at 10:59 AM, Kyle Ellrott <kellrott@soe.ucsc.edu>
>> wrote:
>>
>>> Who here would be interested in helping to work on an implementation of
>>> the Tikerpop3 Gremlin API for Spark? Is this something that should continue
>>> in the Spark discussion group, or should it migrate to the Gremlin message
>>> group?
>>>
>>> Reynold is right that there will be inherent mismatches in the APIs, and
>>> there will need to be some discussions with the GraphX group about the best
>>> way to go. One example would be edge ids. GraphX has vertex ids, but no
>>> explicit edges ids, while Gremlin has both. Edge ids could be put into the
>>> attr field, but then that means the user would have to explicitly subclass
>>> their edge attribute to the edge attribute interface. Is that worth doing,
>>> versus adding an id to everyones's edges?
>>>
>>> Kyle
>>>
>>>
>>> On Thu, Nov 6, 2014 at 7:24 PM, Reynold Xin <rxin@databricks.com> wrote:
>>>
>>>> Some form of graph querying support would be great to have. This can be
>>>> a great community project hosted outside of Spark initially, both due to
>>>> the maturity of the component itself as well as the maturity of query
>>>> language standards (there isn't really a dominant standard for graph ql).
>>>>
>>>> One thing is that GraphX API will need to evolve and probably need to
>>>> provide more primitives in order to support the new ql implementation.
>>>> There might also be inherent mismatches in the way the external API is
>>>> defined vs what GraphX can support. We should discuss those on a
>>>> case-by-case basis.
>>>>
>>>>
>>>> On Thu, Nov 6, 2014 at 5:42 PM, Kyle Ellrott <kellrott@soe.ucsc.edu>
>>>> wrote:
>>>>
>>>>> I think its best to look to existing standard rather then try to make
>>>>> your own. Of course small additions would need to be added to make it
>>>>> valuable for the Spark community, like a method similar to Gremlin's
>>>>> 'table' function, that produces an RDD instead.
>>>>> But there may be a lot of extra code and data structures that would
>>>>> need to be added to make it work, and those may not be directly applicable
>>>>> to all GraphX users. I think it would be best run as a separate
>>>>> module/project that builds directly on top of GraphX.
>>>>>
>>>>> Kyle
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 6, 2014 at 4:39 PM, York, Brennon <
>>>>> Brennon.York@capitalone.com> wrote:
>>>>>
>>>>>> My personal 2c is that, since GraphX is just beginning to provide
a
>>>>>> full featured graph API, I think it would be better to align with
the
>>>>>> TinkerPop group rather than roll our own. In my mind the benefits
out way
>>>>>> the detriments as follows:
>>>>>>
>>>>>> Benefits:
>>>>>> * GraphX gains the ability to become another core tenant within the
>>>>>> TinkerPop community allowing a more diverse group of users into the
Spark
>>>>>> ecosystem.
>>>>>> * TinkerPop can continue to maintain and own a solid / feature-rich
>>>>>> graph API that has already been accepted by a wide audience, relieving
the
>>>>>> pressure of “one off” API additions from the GraphX team.
>>>>>> * GraphX can demonstrate its ability to be a key player in the
>>>>>> GraphDB space sitting inline with other major distributions (Neo4j,
Titan,
>>>>>> etc.).
>>>>>> * Allows for the abstract graph traversal logic (query API) to be
>>>>>> owned and maintained by a group already proven on the topic.
>>>>>>
>>>>>> Drawbacks:
>>>>>> * GraphX doesn’t own the API for its graph query capability. This
>>>>>> could be seen as good or bad, but it might make GraphX-specific
>>>>>> implementation additions more tricky (possibly). Also, GraphX will
need to
>>>>>> maintain the features described within the TinkerPop API as that
might
>>>>>> change in the future.
>>>>>>
>>>>>> From: Kushal Datta <kushal.datta@gmail.com>
>>>>>> Date: Thursday, November 6, 2014 at 4:00 PM
>>>>>> To: "York, Brennon" <brennon.york@capitalone.com>
>>>>>> Cc: Kyle Ellrott <kellrott@soe.ucsc.edu>, Reynold Xin <
>>>>>> rxin@databricks.com>, "dev@spark.apache.org" <dev@spark.apache.org>,
>>>>>> Matthias Broecheler <matthias@thinkaurelius.com>
>>>>>>
>>>>>> Subject: Re: Implementing TinkerPop on top of GraphX
>>>>>>
>>>>>> Before we dive into the implementation details, what are the high
>>>>>> level thoughts on Gremlin/GraphX? Scala already provides the procedural
way
>>>>>> to query graphs in GraphX today. So, today I can run
>>>>>> g.vertices().filter().join() queries as OLAP in GraphX just like
Tinkerpop3
>>>>>> Gremlin, of course sans the useful operators that Gremlin offers
such as
>>>>>> outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin
operators
>>>>>> to GraphX api's a better approach or should we extend the existing
set of
>>>>>> transformations/actions that GraphX already offers with the useful
>>>>>> operators from Gremlin? For example, we add as(), loop() and dedup()
>>>>>> methods in VertexRDD and EdgeRDD.
>>>>>>
>>>>>> Either way we get a desperately needed graph query interface in
>>>>>> GraphX.
>>>>>>
>>>>>> On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon <
>>>>>> Brennon.York@capitalone.com> wrote:
>>>>>>
>>>>>>> This was my thought exactly with the TinkerPop3 release. Looks
like,
>>>>>>> to move this forward, we’d need to implement gremlin-core per
<
>>>>>>> http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core>.
>>>>>>> The real question lies in whether GraphX can only support the
OLTP
>>>>>>> functionality, or if we can bake into it the OLAP requirements
as well. At
>>>>>>> a first glance I believe we could create an entire OLAP system.
If so, I
>>>>>>> believe we could do this in a set of parallel subtasks, those
being the
>>>>>>> implementation of each of the individual API’s (Structure,
Process, and, if
>>>>>>> OLAP, GraphComputer) necessary for gremlin-core. Thoughts?
>>>>>>>
>>>>>>>
>>>>>>> From: Kyle Ellrott <kellrott@soe.ucsc.edu>
>>>>>>> Date: Thursday, November 6, 2014 at 12:10 PM
>>>>>>> To: Kushal Datta <kushal.datta@gmail.com>
>>>>>>> Cc: Reynold Xin <rxin@databricks.com>, "York, Brennon"
<
>>>>>>> brennon.york@capitalone.com>, "dev@spark.apache.org" <
>>>>>>> dev@spark.apache.org>, Matthias Broecheler <
>>>>>>> matthias@thinkaurelius.com>
>>>>>>> Subject: Re: Implementing TinkerPop on top of GraphX
>>>>>>>
>>>>>>> I still have to dig into the Tinkerpop3 internals (I started
my work
>>>>>>> long before it had been released), but I can say that to get
the Tinerpop2
>>>>>>> Gremlin pipeline to work in the GraphX was a bit of a hack. The
>>>>>>> whole Tinkerpop2 Gremlin design was based around streaming pipes
of
>>>>>>> data, rather then large distributed map-reduce operations. I
had to hack
>>>>>>> the pipes to aggregate all of the data and pass a single object
wrapping
>>>>>>> the GraphX RDDs down the pipes in a single go, rather then streaming
it
>>>>>>> element by element.
>>>>>>> Just based on their description, Tinkerpop3 may be more amenable
to
>>>>>>> the Spark platform.
>>>>>>>
>>>>>>> Kyle
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta <
>>>>>>> kushal.datta@gmail.com> wrote:
>>>>>>>
>>>>>>>> What do you guys think about the Tinkerpop3 Gremlin interface?
>>>>>>>> It has MapReduce to run Gremlin operators in a distributed
manner
>>>>>>>> and Giraph to execute vertex programs.
>>>>>>>>
>>>>>>>> The Tinkpop3 is better suited for GraphX.
>>>>>>>>
>>>>>>>> On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott <
>>>>>>>> kellrott@soe.ucsc.edu> wrote:
>>>>>>>>
>>>>>>>>> I've taken a crack at implementing the TinkerPop Blueprints
API in
>>>>>>>>> GraphX (
>>>>>>>>> https://github.com/kellrott/sparkgraph ). I've also implemented
>>>>>>>>> portions of
>>>>>>>>> the Gremlin Search Language and a Parquet based graph
store.
>>>>>>>>> I've been working out finalize some code details and
putting
>>>>>>>>> together
>>>>>>>>> better code examples and documentation before I started
telling
>>>>>>>>> people
>>>>>>>>> about it.
>>>>>>>>> But if you want to start looking at the code, I can answer
any
>>>>>>>>> questions
>>>>>>>>> you have. And if you would like to contribute, I would
really
>>>>>>>>> appreciate
>>>>>>>>> the help.
>>>>>>>>>
>>>>>>>>> Kyle
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin <rxin@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> > cc Matthias
>>>>>>>>> >
>>>>>>>>> > In the past we talked with Matthias and there were
some
>>>>>>>>> discussions about
>>>>>>>>> > this.
>>>>>>>>> >
>>>>>>>>> > On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon <
>>>>>>>>> > Brennon.York@capitalone.com>
>>>>>>>>> > wrote:
>>>>>>>>> >
>>>>>>>>> > > All, was wondering if there had been any discussion
around
>>>>>>>>> this topic
>>>>>>>>> > yet?
>>>>>>>>> > > TinkerPop <https://github.com/tinkerpop>
is a great
>>>>>>>>> abstraction for
>>>>>>>>> > graph
>>>>>>>>> > > databases and has been implemented across various
graph
>>>>>>>>> database backends
>>>>>>>>> > > / gaining traction. Has anyone thought about
integrating the
>>>>>>>>> TinkerPop
>>>>>>>>> > > framework with GraphX to enable GraphX as another
backend? Not
>>>>>>>>> sure if
>>>>>>>>> > > this has been brought up or not, but would
certainly volunteer
>>>>>>>>> to
>>>>>>>>> > > spearhead this effort if the community thinks
it to be a good
>>>>>>>>> idea!
>>>>>>>>> > >
>>>>>>>>> > > As an aside, wasn¹t sure if this discussion
should happen on
>>>>>>>>> the board
>>>>>>>>> > > here or on JIRA, but a made a ticket as well
for reference:
>>>>>>>>> > > https://issues.apache.org/jira/browse/SPARK-4279
>>>>>>>>> > >
>>>>>>>>> > > ________________________________________________________
>>>>>>>>> > >
>>>>>>>>> > > The information contained in this e-mail is
confidential and/or
>>>>>>>>> > > proprietary to Capital One and/or its affiliates.
The
>>>>>>>>> information
>>>>>>>>> > > transmitted herewith is intended only for use
by the
>>>>>>>>> individual or entity
>>>>>>>>> > > to which it is addressed.  If the reader of
this message is
>>>>>>>>> not the
>>>>>>>>> > > intended recipient, you are hereby notified
that any review,
>>>>>>>>> > > retransmission, dissemination, distribution,
copying or other
>>>>>>>>> use of, or
>>>>>>>>> > > taking of any action in reliance upon this
information is
>>>>>>>>> strictly
>>>>>>>>> > > prohibited. If you have received this communication
in error,
>>>>>>>>> please
>>>>>>>>> > > contact the sender and delete the material
from your computer.
>>>>>>>>> > >
>>>>>>>>> > >
>>>>>>>>> > >
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>> > > For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>> > >
>>>>>>>>> > >
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> The information contained in this e-mail is confidential and/or
>>>>>>> proprietary to Capital One and/or its affiliates. The information
>>>>>>> transmitted herewith is intended only for use by the individual
or entity
>>>>>>> to which it is addressed.  If the reader of this message is not
the
>>>>>>> intended recipient, you are hereby notified that any review,
>>>>>>> retransmission, dissemination, distribution, copying or other
use of, or
>>>>>>> taking of any action in reliance upon this information is strictly
>>>>>>> prohibited. If you have received this communication in error,
please
>>>>>>> contact the sender and delete the material from your computer.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> The information contained in this e-mail is confidential and/or
>>>>>> proprietary to Capital One and/or its affiliates. The information
>>>>>> transmitted herewith is intended only for use by the individual or
entity
>>>>>> to which it is addressed.  If the reader of this message is not the
>>>>>> intended recipient, you are hereby notified that any review,
>>>>>> retransmission, dissemination, distribution, copying or other use
of, or
>>>>>> taking of any action in reliance upon this information is strictly
>>>>>> prohibited. If you have received this communication in error, please
>>>>>> contact the sender and delete the material from your computer.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message