spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sun Rui <sunrise_...@163.com>
Subject Re: SparkR error when repartition is called
Date Tue, 09 Aug 2016 06:52:37 GMT
I can’t reproduce your issue with len=10000 in local mode.
Could you give more environment information?
> On Aug 9, 2016, at 11:35, Shane Lee <shane_y_lee@yahoo.com.INVALID> wrote:
> 
> Hi All,
> 
> I am trying out SparkR 2.0 and have run into an issue with repartition. 
> 
> Here is the R code (essentially a port of the pi-calculating scala example in the spark
package) that can reproduce the behavior:
> 
> schema <- structType(structField("input", "integer"), 
>     structField("output", "integer"))
> 
> library(magrittr)
> 
> len = 3000
> data.frame(n = 1:len) %>%
>     as.DataFrame %>%
>     SparkR:::repartition(10L) %>%
> 	dapply(., function (df)
> 	{
> 		library(plyr)
> 		ddply(df, .(n), function (y)
> 		{
> 			data.frame(z = 
> 			{
> 				x1 = runif(1) * 2 - 1
> 				y1 = runif(1) * 2 - 1
> 				z = x1 * x1 + y1 * y1
> 				if (z < 1)
> 				{
> 					1L
> 				}
> 				else
> 				{
> 					0L
> 				}
> 			})
> 		})
> 	}
> 	, schema
> 	) %>% 
> 	SparkR:::summarize(total = sum(.$output)) %>% collect * 4 / len
> 
> For me it runs fine as long as len is less than 5000, otherwise it errors out with the
following message:
> 
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage
56.0 failed 4 times, most recent failure: Lost task 6.3 in stage 56.0 (TID 899, LARBGDV-VM02):
org.apache.spark.SparkException: R computation failed with
>  Error in readBin(con, raw(), stringLen, endian = "big") : 
>   invalid 'n' argument
> Calls: <Anonymous> -> readBin
> Execution halted
> 	at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> 	at org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:59)
> 	at org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:29)
> 	at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:178)
> 	at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:175)
> 	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
> 	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$
> 
> If the repartition call is removed, it runs fine again, even with very large len.
> 
> After looking through the documentations and searching the web, I can't seem to find
any clues how to fix this. Anybody has seen similary problem?
> 
> Thanks in advance for your help.
> 
> Shane
> 


Mime
View raw message