spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Debasish Das <debasish.da...@gmail.com>
Subject Re: Distributed dictionary building
Date Sun, 21 Sep 2014 01:13:40 GMT
Some more debug revealed that as Sean said I have to keep the dictionaries
persisted till I am done with the RDD manipulation.....

Thanks Sean for the pointer...would it be possible to point me to the JIRA
as well ?

Are there plans to make it more transparent for the users ?

Is it possible for the DAG to speculate such things...similar to branch
prediction ideas from comp arch...



On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <debasish.das83@gmail.com>
wrote:

> I changed zipWithIndex to zipWithUniqueId and that seems to be working...
>
> What's the difference between zipWithIndex vs zipWithUniqueId ?
>
> For zipWithIndex we don't need to run the count to compute the offset
> which is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's
> not very clear from docs...
>
>
> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <debasish.das83@gmail.com>
> wrote:
>
>> I did not persist / cache it as I assumed zipWithIndex will preserve
>> order...
>>
>> There is also zipWithUniqueId...I am trying that...If that also shows the
>> same issue, we should make it clear in the docs...
>>
>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <sowen@cloudera.com> wrote:
>>
>>> From offline question - zipWithIndex is being used to assign IDs. From a
>>> recent JIRA discussion I understand this is not deterministic within a
>>> partition so the index can be different when the RDD is reevaluated. If you
>>> need it fixed, persist the zipped RDD on disk or in memory.
>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.das83@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am building a dictionary of RDD[(String, Long)] and after the
>>>> dictionary is built and cached, I find key "almonds" at value 5187 using:
>>>>
>>>> rdd.filter{case(product, index) => product == "almonds"}.collect
>>>>
>>>> Output:
>>>>
>>>> Debug product almonds index 5187
>>>> Now I take the same dictionary and write it out as:
>>>>
>>>> dictionary.map{case(product, index) => product + "," + index}
>>>> .saveAsTextFile(outputPath)
>>>>
>>>> Inside the map I also print what's the product at index 5187 and I get
>>>> a different product:
>>>>
>>>> Debug Index 5187 userOrProduct cardigans
>>>>
>>>> Is this an expected behavior from map ?
>>>>
>>>> By the way "almonds" and "apparel-cardigans" are just one off in the
>>>> index...
>>>>
>>>> I am using spark-1.1 but it's a snapshot..
>>>>
>>>> Thanks.
>>>> Deb
>>>>
>>>>
>>>>
>>
>

Mime
View raw message