spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anwar Rizal <anriza...@gmail.com>
Subject Re: Selecting first ten values in a RDD/partition
Date Thu, 29 May 2014 13:37:07 GMT
Can you clarify what you're trying to achieve here ?

If you want to take only top 10 of each RDD, why don't sort followed by
take(10) of every RDD ?

Or, you want to take top 10 of five minutes ?

Cheers,



On Thu, May 29, 2014 at 2:04 PM, nilmish <nilmish.iit@gmail.com> wrote:

> I have a DSTREAM which consists of RDD partitioned every 2 sec. I have
> sorted
> each RDD and want to retain only top 10 values and discard further value.
> How can I retain only top 10 values ?
>
> I am trying to get top 10 hashtags.  Instead of sorting the entire of
> 5-minute-counts (thereby, incurring the cost of a data shuffle), I am
> trying
> to get the top 10 hashtags in each partition. I am struck at how to retain
> top 10 hashtags in each partition.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Mime
View raw message