This addresses all the points we discussed:
https://reviews.apache.org/r/9732/
I went with Fastutil because it fares well in benchmarks, has a reasonable
API, and still allows for access to internals (for example, in the
arraybacked version, I was able to make removing a single edge O(1) by
simply replacing it with the rightmost one).
On 3/4/13 4:44 PM, "Jake Mannix" <jake.mannix@gmail.com> wrote:
>We could *shudder* write templates for basic abstract primitive Vertex
>classes (Abstract$KeyType$ValueTypeVertex.java.t ...) a la Mahout
>collections...
>
>
>On Mon, Mar 4, 2013 at 3:03 PM, Claudio Martella
><claudio.martella@gmail.com
>> wrote:
>
>> I guess we could do some crazy reflection code and get the types as
>> parameters... i also do feel that sometimes Long ids is overkill, for
>> example.
>>
>>
>> On Mon, Mar 4, 2013 at 11:52 PM, Alessandro Presta <alessandro@fb.com
>> >wrote:
>>
>> >
>> >
>> > On 3/4/13 2:47 PM, "Gianmarco De Francisci Morales" <gdfm@apache.org>
>> > wrote:
>> >
>> > >The interface you propose sounds very reasonable. +1 for that.
>> > >
>> > >For floats vs double, they consume much less memory, O(E)
>>reduction,
>> and
>> > >there is basically never any need to have a double precision on edges
>> > >because they are immutable.
>> > >Anyway it's not a strong constraint, double edges would work as well.
>> >
>> > Sure, makes sense. The question here is whether we should be providing
>> all
>> > possible combinations (basically duplicating a lot of code), or let
>>the
>> > user copy our model implementations.
>> > If there is enough interest, we can always add all combinations of
>> > (Long/Int, Double/Float/Null, Array/HashMap), although it would be
>>tiring
>> > and still not comprehensive (what about String ids?).
>> >
>> > >
>> > >Cheers,
>> > >
>> > >
>> > >Gianmarco
>> > >
>> > >
>> > >On Mon, Mar 4, 2013 at 9:58 PM, Alessandro Presta <alessandro@fb.com>
>> > >wrote:
>> > >
>> > >> Maja and I have thought of a way we could include this type of
>> > >> functionality without it being burdensome of the user.
>> > >>
>> > >> We could add the following interfaces:
>> > >>
>> > >> interface StrictRandomAccesVertexEdges extends VertexEdges {
>> > >> E getEdgeValue(I)
>> > >> }
>> > >>
>> > >> interface MultiRandomAccessVertexEdges extends VertexEdges {
>> > >> Iterable<E> getAllEdgeValues(I)
>> > >> }
>> > >>
>> > >> We could add the following to the Vertex interface:
>> > >>
>> > >> // Returns the first edge value for a given target vertex id (or
>>null
>> if
>> > >> none)
>> > >> E getEdgeValue(I targetVertexId) {
>> > >> // if we have an instance of StrictRandomAccessVertexEdges, we
>> > >>delegate;
>> > >> // otherwise, we iterate over the edges to find the first one
>> > >> }
>> > >>
>> > >> // Returns an Iterable over all edge values for a given target
>>vertex
>> id
>> > >> (only useful for multigraphs)
>> > >> Iterable<E> getAllEdgeValues(I targetVertexId) {
>> > >> // if we have an instance of MultiRandomAccessVertexEdges, we
>> > >>delegate;
>> > >> // otherwise, we wrap the edges into an Iterable that filters for
>> the
>> > >> matching ones
>> > >> }
>> > >>
>> > >> This way, only VertexEdges implementations where it makes sense
>>would
>> > >> define the randomaccess methods, and a default (slow!)
>>implementation
>> > >> would be available in all other cases.
>> > >>
>> > >> How does this sound? I could include it in GIRAPH528.
>> > >>
>> > >> On 3/3/13 4:57 PM, "Alessandro Presta" <alessandro@fb.com> wrote:
>> > >>
>> > >> >I don't contest the graphmatrix duality, but I'm not yet
>>convinced
>> > >>that
>> > >> >you need random access to the edges for the scenarios you
>>mentioned.
>> > >> >PageRank and label propagation, two of the most common
>>applications
>> of
>> > >>our
>> > >> >framework, have indeed a linear algebraic formulation, the bulk
of
>> > >>which
>> > >> >is always matrix multiplication. The most natural and efficient
>>(if
>> not
>> > >> >the only) implementations in our model involve simply iterating
>>over
>> > >>the
>> > >> >edges.
>> > >> >As for your "dot products" example, maybe it would help if you
>>could
>> > >>be a
>> > >> >bit more specific: what are your vertices and edges? How do we
get
>> the
>> > >> >"input vectors"?. Everything should be expressed in terms of
>>vertex
>> > >> >values, edges, and messages, otherwise it doesn't fit the current
>> > >>Giraph
>> > >> >model anyway.
>> > >> >In any case, to compute a dot product efficiently you don't need
>> random
>> > >> >access to both vectors: if you have the nonzero coordinates of
>> vector
>> > >>A
>> > >> >(equivalent of the edge list), and random access to vector B, you
>> > >>iterate
>> > >> >over A and lookup in B. This is efficient.
>> > >> >
>> > >> >> I'm not sure what this would correspond to in graphland, but
I
>>can
>> > >> >> certainly see
>> > >> >> wanting to have a big inmemory matrix which you can compute
dot
>> > >> >>products
>> > >> >> of vectors with each row of it, and Giraph would do this very
>> > >> >>efficiently
>> > >> >> (but some
>> > >> >> of the implementations of this would assume you had random
>>access
>> to
>> > >>the
>> > >> >> edges (meaning edgeId and edgeWeight of each vertex).
>> > >> >
>> > >> >
>> > >> >Here we are talking about a method with signature
>> > >> >
>> > >> > EdgeValueType getEdgeValue(VertexIdType targetVertexId)
>> > >> >
>> > >> >or the multigraph version
>> > >> >
>> > >> > Iterable<EdgeValueType> getEdgeValues(VertexIdType
>> targetVertexId)
>> > >> >
>> > >> >I'm looking for examples of algorithms that can't be implemented
>>in
>> > >>Giraph
>> > >> >because of the lack of this method, and would otherwise be a good
>> fit.
>> > >> >
>> > >> >Having said that, I think there *is* still a case for
>>hashmapbacked
>> > >> >edges, and that is when we need to guarantee a "strict graph"
>>(i.e.
>> no
>> > >> >parallel edges) or when our algorithm uses remote edge removal
>> > >>requests.
>> > >> >So I'm not against having an optimized LongDoubleHashMapEdges (and
>> > >> >LongNullHashSetEdges for the unweighted case) included, provided
>>that
>> > >>it's
>> > >> >correct and efficient.
>> > >> >I will make sure I include it. Now it's a matter of picking a
>> primitive
>> > >> >collections library among HPPC, fastutil, Mahout, Trove, etc...
>> > >> >
>> > >> >On 3/1/13 4:12 PM, "Jake Mannix" <jake.mannix@gmail.com>
wrote:
>> > >> >
>> > >> >>Sorry to have missed out on some of this conversation  had
some
>> work
>> > >> >>stuff
>> > >> >>interrupt (how dare it!)
>> > >> >>
>> > >> >>On Thu, Feb 28, 2013 at 8:13 PM, Alessandro Presta
>> > >> >><alessandro@fb.com>wrote:
>> > >> >>
>> > >> >>> On 2/28/13 5:08 PM, "Jake Mannix" <jake.mannix@gmail.com>
>>wrote:
>> > >> >>>
>> > >> >>> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta
>> > >> >>> ><alessandro@fb.com>wrote:
>> > >> >>> >
>> > >> >>> >> It's not like it causes problems, it's just that
it's a
>>pretty
>> > >>big
>> > >> >>> >> dependency to justify for a small example.
>> > >> >>> >>
>> > >> >>> >> As for the motivation, if your point is to prove
the
>> framework's
>> > >> >>> >> superiority in some context, then you can use
the simplest
>> > >>possible
>> > >> >>> >> implementation (ArrayList).
>> > >> >>> >>
>> > >> >>> >
>> > >> >>> >This takes up LOTS of memory. Primitives rule, objects
drool
>> > >> >>>(memory).
>> > >> >>>
>> > >> >>> Ok, but having to copy the keys and values to external
arrays
>> > >>(which is
>> > >> >>> what you have to do with Mahout's hash map) is even worse.
A
>>good
>> > >> >>> implementation of (long, double) edges (e.g. for
>>RandomWalkVertex)
>> > >>is
>> > >> >>> primitive arrays.
>> > >> >>>
>> > >> >>
>> > >> >>So it is certainly true that the *typical* usage of iteration
>>over
>> > >>Mahout
>> > >> >>maps
>> > >> >>is by doing this copy of internal values (which is not efficient,
>> no),
>> > >> >>it's
>> > >> >>not necessary:
>>OpenLongFloatHashMap.forEachPair(LongFloatProcedure
>> p)
>> > >> >>passes in a callback which operates directly on the underlying
>> > >>primitive
>> > >> >>arrays.
>> > >> >>
>> > >> >>
>> > >> >>>
>> > >> >>> >
>> > >> >>> >
>> > >> >>> >> The Giraph framework is all about iterating over
edges, so
>>an
>> > >> >>> >> implementation that doesn't support that with
reasonable
>> > >>efficiency
>> > >> >>> >> doesn't make a lot of sense to me.
>> > >> >>> >>
>> > >> >>> >> It also follows that hash mapbased implementations
only
>>make
>> > >>sense
>> > >> >>>for
>> > >> >>> >> algorithms that make use of mutations.
>> > >> >>> >>
>> > >> >>> >
>> > >> >>> >Agreed, maps aren't absolutely necessary for immutable
graphs.
>> But
>> > >> >>>random
>> > >> >>> >access to collections of integers _is_ necessary for
many
>> > >>algorithms.
>> > >> >>> >It's
>> > >> >>> >not all just iteration over lists.
>> > >> >>>
>> > >> >>> Can you give me some concrete examples of algorithms where
>>random
>> > >> >>>access
>> > >> >>> to the edges is required?
>> > >> >>> I'm really interested in this, because I'm considering
killing
>> > >> >>> getEdgeValue() if there are no use cases.
>> > >> >>>
>> > >> >>
>> > >> >>So I'm not a graph person, I think in terms of matrices, but
>>since
>> > >>every
>> > >> >>graph
>> > >> >>has an adjecency matrix, and every matrix is the adjacency
>>matrix of
>> > >>some
>> > >> >>(possibly
>> > >> >>bipartite) graph, lets pretend they're basically the same:
>> > >> >>
>> > >> >>If you load a graph into Giraph, and want to compute the matrix
>> > >>product
>> > >> >>of
>> > >> >>this
>> > >> >>graph with collection of input vectors, then you may want to
take
>> > >>each of
>> > >> >>these
>> > >> >>input vectors and compute their dot products with the vertices
of
>> the
>> > >> >>graph
>> > >> >>(to
>> > >> >>mix my metaphors: each vertex is essentially a row or column
of
>>the
>> > >> >>matrix),
>> > >> >>which can be done by either iterating over the input vector
and
>> > >>looking
>> > >> >>up
>> > >> >>randomly for nonzero entries in the vertex, or the reverse.
>> > >> >>
>> > >> >>I'm not sure what this would correspond to in graphland, but
I
>>can
>> > >> >>certainly see
>> > >> >>wanting to have a big inmemory matrix which you can compute
dot
>> > >>products
>> > >> >>of vectors with each row of it, and Giraph would do this very
>> > >>efficiently
>> > >> >>(but some
>> > >> >>of the implementations of this would assume you had random
>>access to
>> > >>the
>> > >> >>edges (meaning edgeId and edgeWeight of each vertex).
>> > >> >>
>> > >> >>
>> > >> >>>
>> > >> >>> >
>> > >> >>> >
>> > >> >>> >> That said, something like a Trove hash map would
probably be
>> more
>> > >> >>> >> appropriate (more efficient than the standard
Java HashMap,
>>at
>> > >>the
>> > >> >>> >>expense
>> > >> >>> >> of generality). That could be a good candidate
for a
>> > >> >>> >> LongDoubleHashMapEdges implementation.
>> > >> >>> >> I can give that a shot if it sounds good.
>> > >> >>> >>
>> > >> >>> >
>> > >> >>> >Trove is LGPL, IIRC, so that doesn't work in Apache
projects.
>> > >> >>>
>> > >> >>> Whoops, totally missed that part.
>> > >> >>>
>> > >> >>> >
>> > >> >>> >What does Giraph depend on of Mahout now? Just
>> mahoutcollections?
>> > >> >>> >That's
>> > >> >>> >not a very big dependency, and has all sorts of
>> > >>primitivetoprimitive
>> > >> >>> >collections,
>> > >> >>> >and from what I've seen benchmarked, just as good
or better
>>than
>> > >> >>>Trove.
>> > >> >>> > Carrot2's
>> > >> >>> >hppc may be better yet, but I'm not sure if that is
stably
>> released
>> > >> >>>yet.
>> > >> >>>
>> > >> >>> I'll take a look at HPPC and Mahout collections then.
They all
>> seem
>> > >>to
>> > >> >>> provide the same stuff, so I'll consider benchmarks and
>> convenience
>> > >>of
>> > >> >>>the
>> > >> >>> API. Thanks for the pointers.
>> > >> >>>
>> > >> >>> >
>> > >> >>> >
>> > >> >>> >>
>> > >> >>> >> On 2/28/13 4:03 PM, "Jake Mannix" <jake.mannix@gmail.com>
>> wrote:
>> > >> >>> >>
>> > >> >>> >> >Is the mahout dependency causing problems?
>> > >> >>> >> >
>> > >> >>> >> >It would be nice if we could actually implement
some of the
>> > >> >>>algorithms
>> > >> >>> >> >that
>> > >> >>> >> >Mahout does via mapreduce in Giraph's BSP
formalism, to
>>show
>> > >>off
>> > >> >>>how
>> > >> >>> >>it
>> > >> >>> >> >improves things. Using the Mahout primitives
can show that
>> it's
>> > >> >>>not
>> > >> >>> >>about
>> > >> >>> >> >the inner loop implementation, but the framework
itself...
>> > >> >>> >> >
>> > >> >>> >> >
>> > >> >>> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman
>> > >> >>> >> ><apache.mailbox@gmail.com>wrote:
>> > >> >>> >> >
>> > >> >>> >> >> I like the idea of refactoring it into
something more
>> > >>appropriate
>> > >> >>> >>for us
>> > >> >>> >> >> and ditching the Mahout dep. Good looking
out.
>> > >> >>> >> >>
>> > >> >>> >> >>
>> > >> >>> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio
Martella <
>> > >> >>> >> >> claudio.martella@gmail.com> wrote:
>> > >> >>> >> >>
>> > >> >>> >> >> > I agree, at this point we could
have a RandomWalkVertex
>> with
>> > >> >>>edge
>> > >> >>> >> >>values,
>> > >> >>> >> >> > and a "nulledged" vertex for the
PR benchmarks.
>> > >> >>> >> >> > We make everybody happy and avoid
code duplication.
>> > >> >>> >> >> >
>> > >> >>> >> >> >
>> > >> >>> >> >> > On Thu, Feb 28, 2013 at 7:12 PM,
Alessandro Presta
>> > >> >>> >><alessandro@fb.com
>> > >> >>> >> >> > >wrote:
>> > >> >>> >> >> >
>> > >> >>> >> >> > > Hi Gianmarco,
>> > >> >>> >> >> > >
>> > >> >>> >> >> > > Yes, there will be more efficient
implementations.
>> > >> >>> >> >> > > In the redesign I'm working
on (GIRAPH528), there
>>will
>> be
>> > >> >>>only
>> > >> >>> >>one
>> > >> >>> >> >> > Vertex
>> > >> >>> >> >> > > class and edge storage is
delegated to a VertexEdges
>> > >>class.
>> > >> >>> >> >> > > So far I'm adding some generic
implementations
>> > >> >>>(ByteArrayEdges,
>> > >> >>> >> >> > > ArrayListEdges, HashMapEdges)
that work for all
>>types,
>> and
>> > >> >>>some
>> > >> >>> >> >> optimized
>> > >> >>> >> >> > > ones (LongDoubleArrayEdges,
LongNullArrayEdges).
>> > >> >>> >> >> > >
>> > >> >>> >> >> > > Do you specifically need edge
values to be float
>>while
>> the
>> > >> >>>other
>> > >> >>> >> >>types
>> > >> >>> >> >> > are
>> > >> >>> >> >> > > double?
>> > >> >>> >> >> > > It seems to me it would make
sense to change
>> > >>RandomWalkVertex
>> > >> >>>to
>> > >> >>> >>use
>> > >> >>> >> >> > > double edge values instead,
and avoid code
>>duplication
>> > >>(i.e.
>> > >> >>> >>adding
>> > >> >>> >> >>a
>> > >> >>> >> >> > > LongFloatArrayEdges that's
basically the same). We're
>> not
>> > >> >>>Trove
>> > >> >>> >> >>after
>> > >> >>> >> >> > all.
>> > >> >>> >> >> > > Makes sense?
>> > >> >>> >> >> > >
>> > >> >>> >> >> > > Thanks for the feedback,
>> > >> >>> >> >> > >
>> > >> >>> >> >> > > Alessandro
>> > >> >>> >> >> > >
>> > >> >>> >> >> > >
>> > >> >>> >> >> > > On 2/28/13 1:54 AM, "Gianmarco
De Francisci Morales"
>> > >> >>> >> >><gdfm@apache.org>
>> > >> >>> >> >> > > wrote:
>> > >> >>> >> >> > >
>> > >> >>> >> >> > > >Hi,
>> > >> >>> >> >> > > >
>> > >> >>> >> >> > > >Maybe the specific implementation
can be thrown
>>away,
>> but
>> > >> >>> >> >>personally I
>> > >> >>> >> >> > > >feel
>> > >> >>> >> >> > > >very strongly for the
need of a good
>> > >>LongDoubleFloatDouble
>> > >> >>> >>vertex.
>> > >> >>> >> >> > > >It's the base for any
serious random walk algorithm.
>> > >> >>> >> >> > > >
>> > >> >>> >> >> > > >I would call for a refactoring
rather than a
>>removal.
>> > >> >>> >> >> > > >
>> > >> >>> >> >> > > >Just my 2c.
>> > >> >>> >> >> > > >
>> > >> >>> >> >> > > >Cheers,
>> > >> >>> >> >> > > >
>> > >> >>> >> >> > > >
>> > >> >>> >> >> > > >Gianmarco
>> > >> >>> >> >> > > >
>> > >> >>> >> >> > > >
>> > >> >>> >> >> > > >On Thu, Feb 28, 2013 at
7:54 AM, Alessandro Presta
>> > >> >>> >> >> > > ><alessandro@fb.com>wrote:
>> > >> >>> >> >> > > >
>> > >> >>> >> >> > > >> Hi all,
>> > >> >>> >> >> > > >>
>> > >> >>> >> >> > > >> Does anyone feel
strongly for
>> > >>LongDoubleFloatDoubleVertex?
>> > >> >>> >> >> > > >> Reasons why I think
it should be removed:
>> > >> >>> >> >> > > >>
>> > >> >>> >> >> > > >> 1. Right now it's
incorrect (returns target
>>vertex
>> > >>id
>> > >> >>>as
>> > >> >>> >>edge
>> > >> >>> >> >> > value).
>> > >> >>> >> >> > > >> 2. Iteration will
always be inefficient, since
>>the
>> > >> >>> >>underlying
>> > >> >>> >> >> > Mahout
>> > >> >>> >> >> > > >> openaddressing hash
map implementation doesn't
>> provide
>> > >> >>> >> >>iterators.
>> > >> >>> >> >> It
>> > >> >>> >> >> > > >> provides a way to
copy the keys and values to
>> external
>> > >> >>> >> >>arrays/lists.
>> > >> >>> >> >> > > >> 3. It's the only
reason why we have Mahout as a
>> > >> >>>dependency.
>> > >> >>> >> >> > > >>
>> > >> >>> >> >> > > >> I think we should
strive to provide model
>> > >>implementations
>> > >> >>>that
>> > >> >>> >> >>are
>> > >> >>> >> >> > > >>generic
>> > >> >>> >> >> > > >> and/or extremely
efficient. This one satisfies
>> neither.
>> > >> >>> >> >> > > >>
>> > >> >>> >> >> > > >> Thanks,
>> > >> >>> >> >> > > >>
>> > >> >>> >> >> > > >> Alessandro
>> > >> >>> >> >> > > >>
>> > >> >>> >> >> > >
>> > >> >>> >> >> > >
>> > >> >>> >> >> >
>> > >> >>> >> >> >
>> > >> >>> >> >> > 
>> > >> >>> >> >> > Claudio Martella
>> > >> >>> >> >> > claudio.martella@gmail.com
>> > >> >>> >> >> >
>> > >> >>> >> >>
>> > >> >>> >> >
>> > >> >>> >> >
>> > >> >>> >> >
>> > >> >>> >> >
>> > >> >>> >> >
>> > >> >>> >> > jake
>> > >> >>> >>
>> > >> >>> >>
>> > >> >>> >
>> > >> >>> >
>> > >> >>> >
>> > >> >>> >
>> > >> >>> > jake
>> > >> >>>
>> > >> >>>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >> jake
>> > >> >
>> > >>
>> > >>
>> >
>> >
>>
>>
>> 
>> Claudio Martella
>> claudio.martella@gmail.com
>>
>
>
>
>
>
> jake
