spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Al M (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-5137) subtract does not take the spark.default.parallelism into account
Date Thu, 08 Jan 2015 08:25:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268969#comment-14268969
] 

Al M commented on SPARK-5137:
-----------------------------

Yes I do mean subtractByKey.  Sorry for not being clear.

I'm new to Spark and it could be that I just don't understand something correctly.  I put
below a more detailed description of the results I saw.  I have default parallelism set to
160 since I am limited for memory and I am working with a lot of data.

* Map is run [11 tasks]
* Filter is run [2 tasks]
* Join with another RDD and run map [160 tasks]
* Jain with another RDD and Map again [160 tasks]
* SubtractByKey is run [11 tasks]

In the last step I run out of memory because subtractByKey was only split into 11 tasks. 
If I override the partitions to 160 then it works fine.  I thought that subtractByKey would
use the default parallelism just like the other tasks after the join.

If the expected solution is that I override the partitions in my call, I'm fine with that.
 So far I managed to avoid setting it in any calls and just setting the default parallelism
instead.  I was concerned that the behavior observed was part of an actual issue.

> subtract does not take the spark.default.parallelism into account
> -----------------------------------------------------------------
>
>                 Key: SPARK-5137
>                 URL: https://issues.apache.org/jira/browse/SPARK-5137
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.2.0
>         Environment: CENTOS 6; scala
>            Reporter: Al M
>            Priority: Trivial
>
> The 'subtract' function (PairRDDFunctions.scala) in scala does not use the default parallelism
value set in the config (spark.default.parallelism).  This is easy enough to work around.
 I can just load the property and pass it in as an argument.
> It would be great if subtract used the default value, just like all the other PairRDDFunctions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message