spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Nastetsky <>
Subject Sort Merge Join from the filesystem
Date Wed, 04 Nov 2015 14:37:53 GMT
(this is kind of a cross-post from the user list)

Does Spark support doing a sort merge join on two datasets on the file
system that have already been partitioned the same with the same number of
partitions and sorted within each partition, without needing to
repartition/sort them again?

This functionality exists in
- Hive (hive.optimize.bucketmapjoin.sortedmerge)
- Pig (USING 'merge')
- MapReduce (CompositeInputFormat)

If this is not supported in Spark, is a ticket already open for it? Does
the Spark architecture present unique difficulties to having this feature?

It is very useful to have this ability, as you can prepare dataset A to be
joined with dataset B before B even exists, by pre-processing A with a


View raw message