spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bharath Ravi Kumar <reachb...@gmail.com>
Subject Re: Implementing percentile through top Vs take
Date Thu, 31 Jul 2014 10:17:52 GMT
Thanks. Turns out I needed the RDD sorted for another purpose, so keeping a
sorted pair rdd anyway made sense. And apologies for not reading the source
for top before asking the question (/*poor attempt to save time*/).


On Thu, Jul 31, 2014 at 12:34 AM, Sean Owen <sowen@cloudera.com> wrote:

> No, it's definitely not done on the driver. It works as you say. Look
> at the source code for RDD.takeOrdered, which is what top calls.
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130
>
> On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar <reachbach@gmail.com>
> wrote:
> > I'm looking to select the top n records (by rank) from a data set of a
> few
> > hundred GB's. My understanding is that JavaRDD.top(n, comparator) is
> > entirely a driver-side operation in that all records are sorted in the
> > driver's memory. I prefer an approach where the records are sorted on the
> > cluster and only the top ones sent to the driver. I'm hence leaning
> towards
> > creating a  JavaPairRDD on a key, then sorting the rdd by key and
> invoking
> > take(N). I'd like to know if rdd.top achieves the same result (while
> being
> > executed on the cluster) as take or if my assumption that it's a driver
> side
> > operation is correct.
> >
> > Thanks,
> > Bharath
>

Mime
View raw message