spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nan Zhu <zhunanmcg...@gmail.com>
Subject Re: Distributed dictionary building
Date Tue, 23 Sep 2014 14:10:12 GMT
shall we document this in the API doc? 

Best, 

-- 
Nan Zhu


On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote:

> zipWithUniqueId is also affected...
> 
> I had to persist the dictionaries to make use of the indices lower down in the flow...
> 
> On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen <sowen@cloudera.com (mailto:sowen@cloudera.com)>
wrote:
> > Reference - https://issues.apache.org/jira/browse/SPARK-3098
> > I imagine zipWithUniqueID is also affected, but may not happen to have
> > exhibited in your test.
> > 
> > On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das <debasish.das83@gmail.com (mailto:debasish.das83@gmail.com)>
wrote:
> > > Some more debug revealed that as Sean said I have to keep the dictionaries
> > > persisted till I am done with the RDD manipulation.....
> > >
> > > Thanks Sean for the pointer...would it be possible to point me to the JIRA
> > > as well ?
> > >
> > > Are there plans to make it more transparent for the users ?
> > >
> > > Is it possible for the DAG to speculate such things...similar to branch
> > > prediction ideas from comp arch...
> > >
> > >
> > >
> > > On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <debasish.das83@gmail.com
(mailto:debasish.das83@gmail.com)>
> > > wrote:
> > >>
> > >> I changed zipWithIndex to zipWithUniqueId and that seems to be working...
> > >>
> > >> What's the difference between zipWithIndex vs zipWithUniqueId ?
> > >>
> > >> For zipWithIndex we don't need to run the count to compute the offset
> > >> which is needed for zipWithUniqueId and so zipWithIndex is efficient ?
It's
> > >> not very clear from docs...
> > >>
> > >>
> > >> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <debasish.das83@gmail.com
(mailto:debasish.das83@gmail.com)>
> > >> wrote:
> > >>>
> > >>> I did not persist / cache it as I assumed zipWithIndex will preserve
> > >>> order...
> > >>>
> > >>> There is also zipWithUniqueId...I am trying that...If that also shows
the
> > >>> same issue, we should make it clear in the docs...
> > >>>
> > >>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <sowen@cloudera.com (mailto:sowen@cloudera.com)>
wrote:
> > >>>>
> > >>>> From offline question - zipWithIndex is being used to assign IDs.
From a
> > >>>> recent JIRA discussion I understand this is not deterministic within
a
> > >>>> partition so the index can be different when the RDD is reevaluated.
If you
> > >>>> need it fixed, persist the zipped RDD on disk or in memory.
> > >>>>
> > >>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.das83@gmail.com
(mailto:debasish.das83@gmail.com)>
> > >>>> wrote:
> > >>>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> I am building a dictionary of RDD[(String, Long)] and after
the
> > >>>>> dictionary is built and cached, I find key "almonds" at value
5187 using:
> > >>>>>
> > >>>>> rdd.filter{case(product, index) => product == "almonds"}.collect
> > >>>>>
> > >>>>> Output:
> > >>>>>
> > >>>>> Debug product almonds index 5187
> > >>>>>
> > >>>>> Now I take the same dictionary and write it out as:
> > >>>>>
> > >>>>> dictionary.map{case(product, index) => product + "," + index}
> > >>>>> .saveAsTextFile(outputPath)
> > >>>>>
> > >>>>> Inside the map I also print what's the product at index 5187
and I get
> > >>>>> a different product:
> > >>>>>
> > >>>>> Debug Index 5187 userOrProduct cardigans
> > >>>>>
> > >>>>> Is this an expected behavior from map ?
> > >>>>>
> > >>>>> By the way "almonds" and "apparel-cardigans" are just one off
in the
> > >>>>> index...
> > >>>>>
> > >>>>> I am using spark-1.1 but it's a snapshot..
> > >>>>>
> > >>>>> Thanks.
> > >>>>> Deb
> > >>>>>
> > >>>>>
> > >>>
> > >>
> > >
> 


Mime
View raw message