From Yong Zhang <>
Subject Re: can spark take advantage of ordered data?
Date Fri, 10 Mar 2017 15:45:05 GMT
I think it is an interesting requirement, but I am not familiar with Spark enough to say it
can be done as latest spark version or not.

>From my understanding, you are looking for some API from the spark to read the source
directly into a ShuffledRDD, which indeed needs (K, V and a Partitioner instance).

I don't think Spark provides that directly as now, but in your case, it makes sense to create
a JIRA for spark to support in the future.

For right now, maybe there are ways to use Spark developerAPI to do what you need, and I will
leave that to other Spark expert to confirm.


From: sourabh chaki <>
Sent: Friday, March 10, 2017 9:03 AM
To: Imran Rashid
Cc: Jonathan Coveney;
Subject: Re: can spark take advantage of ordered data?

My use case is also quite similar. I have 2 feeds. One 3TB and another 100GB. Both the feeds
are generated by hadoop reduce operation and partitioned by hadoop hashpartitioner. 3TB feed
has 10K partitions whereas 100GB file has 200 partitions.

Now when I do a join between these two feeds using spark, spark shuffles both the RDDS and
it takes long time to complete. Can we do something so that spark can recognise the existing
partitions of 3TB feed and shuffles only 200GB feed?
It can be mapside scan for bigger RDD and shuffle read from smaller RDD?

I have looked at spark-sorted project, but that project does not utilise the pre-existing
partitions in the feed.
Any pointer will be helpful.


On Thu, Mar 12, 2015 at 6:35 AM, Imran Rashid <<>>
Hi Jonathan,

you might be interested in (not yet available)
and (not part of spark, but it is available right
now).  Hopefully thats what you are looking for.  To the best of my knowledge that covers
what is available now / what is being worked on.


On Wed, Mar 11, 2015 at 4:38 PM, Jonathan Coveney <<>>
Hello all,

I am wondering if spark already has support for optimizations on sorted data and/or if such
support could be added (I am comfortable dropping to a lower level if necessary to implement
this, but I'm not sure if it is possible at all).

Context: we have a number of data sets which are essentially already sorted on a key. With
our current systems, we can take advantage of this to do a lot of analysis in a very efficient
fashion...merges and joins, for example, can be done very efficiently, as can folds on a secondary
key and so on.

I was wondering if spark would be a fit for implementing these sorts of optimizations? Obviously
it is sort of a niche case, but would this be achievable? Any pointers on where I should look?

