spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Imran Rashid <iras...@cloudera.com>
Subject Re: Top, takeOrdered, sortByKey
Date Thu, 12 Mar 2015 00:23:41 GMT
I am not entirely sure I understand your question -- are you saying:

* scoring a sample of 50k events is fast
* taking the top N scores of 77M events is slow, no matter what N is

?

if so, this shouldn't come as a huge surprise.  You can't find the top
scoring elements (no matter how small N is) unless you score all 77M of
them.  Very naively, you would expect scoring 77M events to take ~1000
times as long as scoring 50k events, right?  The fact that it doesn't take
that much longer is probably b/c of the overhead of just launching the jobs.



On Mon, Mar 9, 2015 at 4:21 PM, Saba Sehrish <ssehrish@fnal.gov> wrote:

>
>
>  *From:* Saba Sehrish <ssehrish@fnal.gov>
> *Date:* March 9, 2015 at 4:11:07 PM CDT
> *To:* <user-faq@spark.apache.org>
> *Subject:* *Using top, takeOrdered, sortByKey*
>
>   I am using spark for a template matching problem. We have 77 million
> events in the template library, and we compare energy of each of the input
> event with the each of the template event and return a score. In the end we
> return best 10000 matches with lowest score. A score of 0 is a perfect
> match.
>
>  I down sampled the problem to use only 50k events. For a single event
> matching across all the events in the template (50k) I see 150-200ms for
> score calculation on 25 cores (using YARN cluster), but after that when I
> perform either a top or takeOrdered or even sortByKey the time reaches to
> 25-50s.
> So far I am not able to figure out why such a huge gap going from a list
> of scores to a list of top 1000 scores and why sorting or getting best X
> matches is being dominant by a large factor. One thing I have noticed is
> that it doesn’t matter how many elements I return the time range is the
> same 25-50s for 10 - 10000 elements.
>
>  Any suggestions? if I am not using API properly?
>
>  scores is JavaPairRDD<Integer, Double>, and I do something like
> numbestmatches is 10, 100, 10000 or any number.
>
>   List <Tuple2<Integer, Double>> bestscores_list =
> scores.takeOrdered(numbestmatches, new TupleComparator());
>  Or
>  List <Tuple2<Integer, Double>> bestscores_list =
> scores.top(numbestmatches, new TupleComparator());
>  Or
>  List <Tuple2<Integer, Double>> bestscores_list = scores.sortByKey();
>
>

Mime
View raw message