spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tushar Sudake <etusha...@gmail.com>
Subject How to merge fragmented IDs into one cluster if one/more IDs are shared
Date Thu, 05 Oct 2017 19:35:44 GMT
Hello Sparkans,

I want to merge following cluster / set of IDs into one if they have shared
IDs.

For example:

uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4
uuid_3_2,uuid_3_5,uuid_3_6
uuid_3_5,uuid_3_7,uuid_3_8,uuid_3_9

into single:

uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4,uuid_3_5,uuid_3_6,uuid_3_7,uuid_3_8,uuid_3_9

because they're linked through 'uuid_3_2' and 'uuid_3_5'.

How can I do this in Spark?

One solution I can think of is to use Graphx. Keep adding links between two
IDs and Graphx will take care of creating clusters. But these are UUIDs and
Graphx only supports Long for VertexID. Also, my input data is huge (50 M
Unique IDs), so maintaining collision free map of UUID <-> Long will be
tough.

Any suggestions?

Thanks!

Mime
View raw message