spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sabarish Sasidharan <sabarish.sasidha...@manthan.com>
Subject Re: .cache() changes contents of RDD
Date Sat, 27 Feb 2016 15:02:48 GMT
This is because Hadoop writables are being reused. Just map it to some
custom type and then do further operations including cache() on it.

Regards
Sab
On 27-Feb-2016 9:11 am, "Yan Yang" <yan@wealthfront.com> wrote:

> Hi
>
> I am pretty new to Spark, and after experimentation on our pipelines. I
> ran into this weird issue.
>
> The Scala code is as below:
>
> val input = sc.newAPIHadoopRDD(...)
> val rdd = input.map(...)
> rdd.cache()
> rdd.saveAsTextFile(...)
>
> I found rdd to consist of 80+K identical rows. To be more precise, the
> number of rows is right, but all are identical.
>
> The truly weird part is if I remove rdd.cache(), everything works just
> fine. I have encountered this issue on a few occasions.
>
> Thanks
> Yan
>
>
>
>
>

Mime
View raw message