spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Writing to HDFS
Date Tue, 04 Aug 2015 09:07:55 GMT
Just to add rdd.take(1) won't trigger the entire computation, it will just
pull out the first record. You need to do a rdd.count() or rdd.saveAs*Files
to trigger the complete pipeline. How many partitions do you see in the
last stage?

Thanks
Best Regards

On Tue, Aug 4, 2015 at 7:10 AM, ayan guha <guha.ayan@gmail.com> wrote:

> Is your data skewed? What happens if you do rdd.count()?
> On 4 Aug 2015 05:49, "Jasleen Kaur" <jasleenkaur1291@gmail.com> wrote:
>
>> I am executing a spark job on a cluster as a yarn-client(Yarn cluster not
>> an option due to permission issues).
>>
>>    - num-executors 800
>>    - spark.akka.frameSize=1024
>>    - spark.default.parallelism=25600
>>    - driver-memory=4G
>>    - executor-memory=32G.
>>    - My input size is around 1.5TB.
>>
>> My problem is when I execute rdd.saveAsTextFile(outputPath,
>> classOf[org.apache.hadoop.io.compress.SnappyCodec])(Saving as avro also not
>> an option, I have tried saveAsSequenceFile with GZIP,
>> saveAsNewAPIHadoopFile with same result), I get heap space issue. On the
>> other hand if I execute rdd.take(1). I get no such issue. So I am assuming
>> that issue is due to write.
>>
>

Mime
View raw message