spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Darabos <daniel.dara...@lynxanalytics.com>
Subject Re: Different Sorting RDD methods in Apache Spark
Date Tue, 09 Jun 2015 10:05:30 GMT
It would be even faster to load the data on the driver and sort it there
without using Spark :). Using reduce() is cheating, because it only works
as long as the data fits on one machine. That is not the targeted use case
of a distributed computation system. You can repeat your test with more
data (that doesn't fit on one machine) to see what I mean.

On Tue, Jun 9, 2015 at 8:30 AM, raggy <raghav0110.cs@gmail.com> wrote:

> For a research project, I tried sorting the elements in an RDD. I did this
> in
> two different approaches.
>
> In the first method, I applied a mapPartitions() function on the RDD, so
> that it would sort the contents of the RDD, and provide a result RDD that
> contains the sorted list as the only record in the RDD. Then, I applied a
> reduce function which basically merges sorted lists.
>
> I ran these experiments on an EC2 cluster containing 30 nodes. I set it up
> using the spark ec2 script. The data file was stored in HDFS.
>
> In the second approach I used the sortBy method in Spark.
>
> I performed these operation on the US census data(100MB) found here
>
> A single lines looks like this
>
> 9, Not in universe, 0, 0, Children, 0, Not in universe, Never married, Not
> in universe or children, Not in universe, White, All other, Female, Not in
> universe, Not in universe, Children or Armed Forces, 0, 0, 0, Nonfiler, Not
> in universe, Not in universe, Child <18 never marr not in subfamily, Child
> under 18 never married, 1758.14, Nonmover, Nonmover, Nonmover, Yes, Not in
> universe, 0, Both parents present, United-States, United-States,
> United-States, Native- Born in the United States, 0, Not in universe, 0, 0,
> 94, - 50000.
> I sorted based on the 25th value in the CSV. In this line that is 1758.14.
>
> I noticed that sortBy performs worse than the other method. Is this the
> expected scenario? If it is, why wouldn't the mapPartitions() and reduce()
> be the default sorting approach?
>
> Here is my implementation
>
> public static void sortBy(JavaSparkContext sc){
>         JavaRDD<String> rdd = sc.textFile("/data.txt",32);
>         long start = System.currentTimeMillis();
>         rdd.sortBy(new Function<String, Double>(){
>
>             @Override
>                 public Double call(String v1) throws Exception {
>                       // TODO Auto-generated method stub
>                   String [] arr = v1.split(",");
>                   return Double.parseDouble(arr[24]);
>                 }
>         }, true, 9).collect();
>         long end = System.currentTimeMillis();
>         System.out.println("SortBy: " + (end - start));
>   }
>
> public static void sortList(JavaSparkContext sc){
>         JavaRDD<String> rdd = sc.textFile("/data.txt",32); //parallelize(l,
> 8);
>         long start = System.currentTimeMillis();
>         JavaRDD<LinkedList&lt;Tuple2&lt;Double, String>>> rdd3 =
> rdd.mapPartitions(new FlatMapFunction<Iterator&lt;String>,
> LinkedList<Tuple2&lt;Double, String>>>(){
>
>         @Override
>         public Iterable<LinkedList&lt;Tuple2&lt;Double, String>>>
> call(Iterator<String> t)
>             throws Exception {
>           // TODO Auto-generated method stub
>           LinkedList<Tuple2&lt;Double, String>> lines = new
> LinkedList<Tuple2&lt;Double, String>>();
>           while(t.hasNext()){
>             String s = t.next();
>             String arr1[] = s.split(",");
>             Tuple2<Double, String> t1 = new Tuple2<Double,
> String>(Double.parseDouble(arr1[24]),s);
>             lines.add(t1);
>           }
>           Collections.sort(lines, new IncomeComparator());
>           LinkedList<LinkedList&lt;Tuple2&lt;Double, String>>> list
= new
> LinkedList<LinkedList&lt;Tuple2&lt;Double, String>>>();
>           list.add(lines);
>           return list;
>         }
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Different-Sorting-RDD-methods-in-Apache-Spark-tp23214.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message