spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: repartition combined with zipWithIndex get stuck
Date Sun, 16 Nov 2014 04:50:01 GMT
PR: https://github.com/apache/spark/pull/3291 . For now, here is a workaround:

val a = sc.parallelize(1 to 10).zipWithIndex()
a.partitions // call .partitions explicitly
a.repartition(10).count()

Thanks for reporting the bug! -Xiangrui



On Sat, Nov 15, 2014 at 8:38 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
> I think I understand where the bug is now. I created a JIRA
> (https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR
> soon. -Xiangrui
>
> On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
>> This is a bug. Could you make a JIRA? -Xiangrui
>>
>> On Sat, Nov 15, 2014 at 3:27 AM, lev <katzav@gmail.com> wrote:
>>> Hi,
>>>
>>> I'm having trouble using both zipWithIndex and repartition. When I use them
>>> both, the following action will get stuck and won't return.
>>> I'm using spark 1.1.0.
>>>
>>>
>>> Those 2 lines work as expected:
>>>
>>> scala> sc.parallelize(1 to 10).repartition(10).count()
>>> res0: Long = 10
>>>
>>> scala> sc.parallelize(1 to 10).zipWithIndex.count()
>>> res1: Long = 10
>>>
>>>
>>> But this statement get stuck and doesn't return:
>>>
>>> scala> sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
>>> 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at
>>> Option.scala:120
>>> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at
>>> Option.scala:120) with 3 output partitions (allowLocal=false)
>>> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at
>>> Option.scala:120)
>>> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage:
>>> List()
>>> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List()
>>> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4
>>> (ParallelCollectionRDD[7] at parallelize at <console>:13), which has no
>>> missing parents
>>> 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called
>>> with curMem=7616, maxMem=138938941
>>> 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as
>>> values in memory (estimated size 1096.0 B, free 132.5 MB)
>>>
>>>
>>> Am I doing something wrong here or is it a bug?
>>> Is there some work around?
>>>
>>> Thanks,
>>> Lev.
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message