spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Balakumar iyer S <bala93ku...@gmail.com>
Subject Re: Spark 2.3 Dataframe Grouby operation throws IllegalArgumentException on Large dataset
Date Wed, 24 Jul 2019 04:15:10 GMT
Hi Bobby Evans,

I apologise for the delayed response , yes you are right I missed out to
paste the complete stack trace exception. Here with I have attached the
complete yarn log for the same.

Thank you , It would be helpful if you guys could assist me on this error.

-----------------------------------------------------------------------------------------------------------------------------------------
Regards
Balakumar Seetharaman


On Mon, Jul 22, 2019 at 7:05 PM Bobby Evans <bobby@apache.org> wrote:

> You are missing a lot of the stack trace that could explain the
> exception.  All it shows is that an exception happened while writing out
> the orc file, not what that underlying exception is, there should be at
> least one more caused by under the one you included.
>
> Thanks,
>
> Bobby
>
> On Mon, Jul 22, 2019 at 5:58 AM Balakumar iyer S <bala93kumar@gmail.com>
> wrote:
>
>> Hi ,
>>
>> I am trying to perform a group by  followed by aggregate collect set
>> operation on a two column data-set    schema (LeftData int , RightData
>> int).
>>
>> code snippet
>>
>>   val wind_2  =
>> dframe.groupBy("LeftData").agg(collect_set(array("RightData")))
>>
>>      wind_2.write.mode(SaveMode.Append).format("orc").save(args(1))
>>
>> the above code works fine on a smaller dataset but throws the following
>> error on large dataset (where each keys in LeftData column  needs to be
>> grouped with 64k values approximately ).
>>
>> Could some one assist me on this , should i  set any configuration to
>> accommodate such a large  values?
>>
>> ERROR
>> ---------------------------------
>> Driver stacktrace:
>> at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>> at scala.Option.foreach(Option.scala:257)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>> at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>>
>>
>> Caused by: org.apache.spark.SparkException: Task failed while writing
>> rows.
>> at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>>
>> --
>> REGARDS
>> BALAKUMAR SEETHARAMAN
>>
>>

-- 
REGARDS
BALAKUMAR SEETHARAMAN

Mime
View raw message