spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Distributed dictionary building
Date Sat, 20 Sep 2014 20:44:16 GMT
>From offline question - zipWithIndex is being used to assign IDs. From a
recent JIRA discussion I understand this is not deterministic within a
partition so the index can be different when the RDD is reevaluated. If you
need it fixed, persist the zipped RDD on disk or in memory.
On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.das83@gmail.com> wrote:

> Hi,
>
> I am building a dictionary of RDD[(String, Long)] and after the dictionary
> is built and cached, I find key "almonds" at value 5187 using:
>
> rdd.filter{case(product, index) => product == "almonds"}.collect
>
> Output:
>
> Debug product almonds index 5187
> Now I take the same dictionary and write it out as:
>
> dictionary.map{case(product, index) => product + "," + index}
> .saveAsTextFile(outputPath)
>
> Inside the map I also print what's the product at index 5187 and I get a
> different product:
>
> Debug Index 5187 userOrProduct cardigans
>
> Is this an expected behavior from map ?
>
> By the way "almonds" and "apparel-cardigans" are just one off in the
> index...
>
> I am using spark-1.1 but it's a snapshot..
>
> Thanks.
> Deb
>
>
>

Mime
View raw message