spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <>
Subject Re: Spark join over sorted columns of dataset.
Date Fri, 03 Mar 2017 16:23:11 GMT
For RDD the shuffle is already skipped but the sort is not. In spark-sorted
we track partitioning and sorting within partitions for key-value RDDs and
can avoid the sort. See:

For Dataset/DataFrame such optimizations are done automatically, however
it's currently not always working for Dataset, see:

On Mar 3, 2017 11:06 AM, "Rohit Verma" <> wrote:

Sending it to dev’s.
Can you please help me providing some ideas for below.

> On Feb 23, 2017, at 3:47 PM, Rohit Verma <>
> Hi
> While joining two columns of different dataset, how to optimize join if
both the columns are pre sorted within the dataset.
> So that when spark do sort merge join the sorting phase can skipped.
> Regards
> Rohit

View raw message