spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Distributed dictionary building
Date Tue, 23 Sep 2014 13:58:46 GMT
Yes, Matei made a JIRA last week and I just suggested a PR:
https://github.com/apache/spark/pull/2508
On Sep 23, 2014 2:55 PM, "Nan Zhu" <zhunanmcgill@gmail.com> wrote:

>  shall we document this in the API doc?
>
> Best,
>
> --
> Nan Zhu
>
> On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote:
>
> zipWithUniqueId is also affected...
>
> I had to persist the dictionaries to make use of the indices lower down in
> the flow...
>
> On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen <sowen@cloudera.com> wrote:
>
> Reference - https://issues.apache.org/jira/browse/SPARK-3098
> I imagine zipWithUniqueID is also affected, but may not happen to have
> exhibited in your test.
>
> On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das <debasish.das83@gmail.com>
> wrote:
> > Some more debug revealed that as Sean said I have to keep the
> dictionaries
> > persisted till I am done with the RDD manipulation.....
> >
> > Thanks Sean for the pointer...would it be possible to point me to the
> JIRA
> > as well ?
> >
> > Are there plans to make it more transparent for the users ?
> >
> > Is it possible for the DAG to speculate such things...similar to branch
> > prediction ideas from comp arch...
> >
> >
> >
> > On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <debasish.das83@gmail.com>
> > wrote:
> >>
> >> I changed zipWithIndex to zipWithUniqueId and that seems to be
> working...
> >>
> >> What's the difference between zipWithIndex vs zipWithUniqueId ?
> >>
> >> For zipWithIndex we don't need to run the count to compute the offset
> >> which is needed for zipWithUniqueId and so zipWithIndex is efficient ?
> It's
> >> not very clear from docs...
> >>
> >>
> >> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <debasish.das83@gmail.com
> >
> >> wrote:
> >>>
> >>> I did not persist / cache it as I assumed zipWithIndex will preserve
> >>> order...
> >>>
> >>> There is also zipWithUniqueId...I am trying that...If that also shows
> the
> >>> same issue, we should make it clear in the docs...
> >>>
> >>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <sowen@cloudera.com> wrote:
> >>>>
> >>>> From offline question - zipWithIndex is being used to assign IDs.
> From a
> >>>> recent JIRA discussion I understand this is not deterministic within
a
> >>>> partition so the index can be different when the RDD is reevaluated.
> If you
> >>>> need it fixed, persist the zipped RDD on disk or in memory.
> >>>>
> >>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.das83@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am building a dictionary of RDD[(String, Long)] and after the
> >>>>> dictionary is built and cached, I find key "almonds" at value 5187
> using:
> >>>>>
> >>>>> rdd.filter{case(product, index) => product == "almonds"}.collect
> >>>>>
> >>>>> Output:
> >>>>>
> >>>>> Debug product almonds index 5187
> >>>>>
> >>>>> Now I take the same dictionary and write it out as:
> >>>>>
> >>>>> dictionary.map{case(product, index) => product + "," + index}
> >>>>> .saveAsTextFile(outputPath)
> >>>>>
> >>>>> Inside the map I also print what's the product at index 5187 and
I
> get
> >>>>> a different product:
> >>>>>
> >>>>> Debug Index 5187 userOrProduct cardigans
> >>>>>
> >>>>> Is this an expected behavior from map ?
> >>>>>
> >>>>> By the way "almonds" and "apparel-cardigans" are just one off in
the
> >>>>> index...
> >>>>>
> >>>>> I am using spark-1.1 but it's a snapshot..
> >>>>>
> >>>>> Thanks.
> >>>>> Deb
> >>>>>
> >>>>>
> >>>
> >>
> >
>
>
>
>

Mime
View raw message