spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DB Tsai <dbt...@stanford.edu>
Subject Re: parallel Reduce within a key
Date Fri, 20 Jun 2014 17:55:51 GMT
Currently, the reduce operation combines the result from mapper
sequentially, so it's O(n).

Xiangrui is working on treeReduce which is O(log(n)). Based on the
benchmark, it dramatically increase the performance. You can test the
code in his own branch.
https://github.com/apache/spark/pull/1110

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Jun 20, 2014 at 6:57 AM, ansriniv <ansriniv@gmail.com> wrote:
> Hi,
>
> I am on Spark 0.9.0
>
> I have a 2 node cluster (2 worker nodes) with 16 cores on each node (so, 32
> cores in the cluster).
> I have an input rdd with 64 partitions.
>
> I am running  "sc.mapPartitions(...).reduce(...)"
>
> I can see that I get full parallelism on the mapper (all my 32 cores are
> busy simultaneously). However, when it comes to reduce(), the outputs of the
> mappers are all reduced SERIALLY. Further, all the reduce processing happens
> only on 1 of the workers.
>
> I was expecting that the outputs of the 16 mappers on node 1 would be
> reduced in parallel in node 1 while the outputs of the 16 mappers on node 2
> would be reduced in parallel on node 2 and there would be 1 final inter-node
> reduce (of node 1 reduced result and node 2 reduced result).
>
> Isn't parallel reduce supported WITHIN a key (in this case I have no key) ?
> (I know that there is parallelism in reduce across keys)
>
> Best Regards
> Anand
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallel-Reduce-within-a-key-tp7998.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Mime
View raw message