spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 孫澤恩 <gn00710...@gmail.com>
Subject Re: How to merge fragmented IDs into one cluster if one/more IDs are shared
Date Fri, 06 Oct 2017 02:32:02 GMT
Hi there,

About GraphX, i thing that the graph process is parse you data into (VertexA) - [Edge1] -
(VertexB). 
As we see the Graph class of GraphX contains edges and vertices.

Such that, in the first line of your data would be parse to 

uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_3 as vertices.
(uuid_3_1,uuid_3_2),(uuid_3_2,uuid_3_3),(uuid_3_3,uuid_3_4) as edges.

It could be make in single result as your want but I think there should be a better way than
GraphX.

And if you want to use GrpahX as the solution then there is a way that I used to convert uuid
to long.
You could use a hash encode and decode function to convert sting to long type or convert it
back.

Hope that would help you.

Sean Sun


> On 6 Oct 2017, at 3:35 AM, Tushar Sudake <etushar89@gmail.com> wrote:
> 
> Hello Sparkans,
> 
> I want to merge following cluster / set of IDs into one if they have shared IDs.
> 
> For example:
> 
> uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4
> uuid_3_2,uuid_3_5,uuid_3_6
> uuid_3_5,uuid_3_7,uuid_3_8,uuid_3_9
> into single:
> 
> uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4,uuid_3_5,uuid_3_6,uuid_3_7,uuid_3_8,uuid_3_9
> because they're linked through 'uuid_3_2' and 'uuid_3_5'.
> 
> How can I do this in Spark?
> 
> One solution I can think of is to use Graphx. Keep adding links between two IDs and Graphx
will take care of creating clusters. But these are UUIDs and Graphx only supports Long for
VertexID. Also, my input data is huge (50 M Unique IDs), so maintaining collision free map
of UUID <-> Long will be tough.
> 
> Any suggestions?
> 
> Thanks!


Mime
View raw message