spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ameet Kini <>
Subject examples of map-side join of two hadoop sequence files
Date Fri, 18 Oct 2013 21:20:02 GMT
I've seen discussions where the suggestion is to do a map-side join, but
haven't seen an example yet, and can certainly use one. I have two sequence
files where the key is unique within each file, so the join is a one-to-one
join, and can hence benefit from a map-side join. However both sequence
files can be large, so reading one of them completely in the driver and
broadcasting it out would be expensive.

I don't think there is a map-side join implementation in Spark but earlier
suggestions have been to write one using mapPartitions on one of the
operands as the outer loop. If that is the case, how would I fetch the
split corresponding to the keys in the outer's partition. I'd prefer to do
a fetch-per-partition rather than a fetch-per-tuple.

In any case, some feedback, and preferably, an example of a map-side join
without broadcasting would help.


View raw message