spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bchazalet <bchaza...@companywatch.net>
Subject sort algorithm using sortBy
Date Tue, 02 Dec 2014 17:19:59 GMT
I am trying to understand the sort algorithm that is used in RDD#sortBy. I
have read that post  from Matei
<http://apache-spark-user-list.1001560.n3.nabble.com/Complexity-Efficiency-of-SortByKey-tp14328p14332.html>
 
and that helps a little bit already.

I'd like to further understand the distributed merge-sort because in my case
the sort takes 10 times longer if it happens on a field whose values are not
well distributed (the field's value is 0 for many of the items) compared to
a sort on a field whose values are better distributed.

In particular, I am wondering if the sort algorithm can be modified/injected
with one that would better fit the first distribution (given that this would
be known in advance).

I'll be happy to look at the code myself, if someone could provide me with a
pointer to the file(s) I should have a look at.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sort-algorithm-using-sortBy-tp20179.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message