spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nan Zhu <zhunanmcg...@gmail.com>
Subject Re: Distributed dictionary building
Date Tue, 23 Sep 2014 14:14:15 GMT
great, thanks 

-- 
Nan Zhu


On Tuesday, September 23, 2014 at 9:58 AM, Sean Owen wrote:

> Yes, Matei made a JIRA last week and I just suggested a PR:
> https://github.com/apache/spark/pull/2508 
> On Sep 23, 2014 2:55 PM, "Nan Zhu" <zhunanmcgill@gmail.com (mailto:zhunanmcgill@gmail.com)>
wrote:
> > shall we document this in the API doc? 
> > 
> > Best, 
> > 
> > -- 
> > Nan Zhu
> > 
> > 
> > On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote:
> > 
> > > zipWithUniqueId is also affected...
> > > 
> > > I had to persist the dictionaries to make use of the indices lower down in
the flow...
> > > 
> > > On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen <sowen@cloudera.com (mailto:sowen@cloudera.com)>
wrote:
> > > > Reference - https://issues.apache.org/jira/browse/SPARK-3098
> > > > I imagine zipWithUniqueID is also affected, but may not happen to have
> > > > exhibited in your test.
> > > > 
> > > > On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das <debasish.das83@gmail.com
(mailto:debasish.das83@gmail.com)> wrote:
> > > > > Some more debug revealed that as Sean said I have to keep the dictionaries
> > > > > persisted till I am done with the RDD manipulation.....
> > > > >
> > > > > Thanks Sean for the pointer...would it be possible to point me to
the JIRA
> > > > > as well ?
> > > > >
> > > > > Are there plans to make it more transparent for the users ?
> > > > >
> > > > > Is it possible for the DAG to speculate such things...similar to
branch
> > > > > prediction ideas from comp arch...
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <debasish.das83@gmail.com
(mailto:debasish.das83@gmail.com)>
> > > > > wrote:
> > > > >>
> > > > >> I changed zipWithIndex to zipWithUniqueId and that seems to be
working...
> > > > >>
> > > > >> What's the difference between zipWithIndex vs zipWithUniqueId
?
> > > > >>
> > > > >> For zipWithIndex we don't need to run the count to compute the
offset
> > > > >> which is needed for zipWithUniqueId and so zipWithIndex is efficient
? It's
> > > > >> not very clear from docs...
> > > > >>
> > > > >>
> > > > >> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <debasish.das83@gmail.com
(mailto:debasish.das83@gmail.com)>
> > > > >> wrote:
> > > > >>>
> > > > >>> I did not persist / cache it as I assumed zipWithIndex will
preserve
> > > > >>> order...
> > > > >>>
> > > > >>> There is also zipWithUniqueId...I am trying that...If that
also shows the
> > > > >>> same issue, we should make it clear in the docs...
> > > > >>>
> > > > >>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <sowen@cloudera.com
(mailto:sowen@cloudera.com)> wrote:
> > > > >>>>
> > > > >>>> From offline question - zipWithIndex is being used to
assign IDs. From a
> > > > >>>> recent JIRA discussion I understand this is not deterministic
within a
> > > > >>>> partition so the index can be different when the RDD
is reevaluated. If you
> > > > >>>> need it fixed, persist the zipped RDD on disk or in memory.
> > > > >>>>
> > > > >>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.das83@gmail.com
(mailto:debasish.das83@gmail.com)>
> > > > >>>> wrote:
> > > > >>>>>
> > > > >>>>> Hi,
> > > > >>>>>
> > > > >>>>> I am building a dictionary of RDD[(String, Long)]
and after the
> > > > >>>>> dictionary is built and cached, I find key "almonds"
at value 5187 using:
> > > > >>>>>
> > > > >>>>> rdd.filter{case(product, index) => product ==
"almonds"}.collect
> > > > >>>>>
> > > > >>>>> Output:
> > > > >>>>>
> > > > >>>>> Debug product almonds index 5187
> > > > >>>>>
> > > > >>>>> Now I take the same dictionary and write it out as:
> > > > >>>>>
> > > > >>>>> dictionary.map{case(product, index) => product
+ "," + index}
> > > > >>>>> .saveAsTextFile(outputPath)
> > > > >>>>>
> > > > >>>>> Inside the map I also print what's the product at
index 5187 and I get
> > > > >>>>> a different product:
> > > > >>>>>
> > > > >>>>> Debug Index 5187 userOrProduct cardigans
> > > > >>>>>
> > > > >>>>> Is this an expected behavior from map ?
> > > > >>>>>
> > > > >>>>> By the way "almonds" and "apparel-cardigans" are
just one off in the
> > > > >>>>> index...
> > > > >>>>>
> > > > >>>>> I am using spark-1.1 but it's a snapshot..
> > > > >>>>>
> > > > >>>>> Thanks.
> > > > >>>>> Deb
> > > > >>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > > >
> > > 
> > 


Mime
View raw message