mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: ItemSimilarityJob
Date Tue, 05 Jun 2012 07:30:31 GMT
Its not to do with Hadoop. Hadoop happily partitions and merges without
ordering. No the input does not need to be sorted.

I was really referring to the fact that some similarity functions need to
perform an intersection and this is much faster when keys are ordered.

Moot point now. All keys are integers.
Sean
 On Jun 5, 2012 12:21 AM, "Lance Norskog" <goksron@gmail.com> wrote:

> It needs a complete "ordering", meaning code that takes any two values
> and says "this one before that one". This lets Hadoop do global
> sorting. If they're strings you would sort on the strings.
>
> On Mon, Jun 4, 2012 at 4:00 PM, Something Something
> <mailinglists19@gmail.com> wrote:
> > Fair enough.  Just one more question:
> >
> > 1)  >>it just needs to have an ordering
> > The input data doesn't need to be in any particular sequence, correct?
>  Not
> > sure what you mean by 'needs to have an ordering'.
> >
> >
> > On Mon, Jun 4, 2012 at 3:29 PM, Sean Owen <srowen@gmail.com> wrote:
> >
> >> That's how it used to work but it was restricted to integers a long time
> >> ago purely for speed and memory. It makes a big difference. Many (most?)
> >> use cases have some numeric ID for these guys already.  Otherwise no
> reason
> >> it needs to be an integer it just needs to have an ordering.
> >>
> >> You can retain the mapping how you like. All you really need are the
> >> original ID values to recreate the mapping as it is just bases on MD5.
> So a
> >> file is sufficient for example. But to do the mapping on the fly it has
> to
> >> be in memory yes or else it is too slow.
> >>
> >> Best is to find a numeric ID to use in your model if you can.
> >>
> >> Myrrix works this way too, if desired, but almost as a feature as the
> >> 'real' IDs need never be sent into the hosted recommender in the cloud,
> >> just a hashed numeric ID. That's nice from a security or privacy
> >> standpoint.
> >>  On Jun 4, 2012 11:05 PM, "Something Something" <
> mailinglists19@gmail.com>
> >> wrote:
> >>
> >> > Hmm.. that's a bit weird.  Looking at the algorithm, I don't
> understand
> >> why
> >> > UserID has to be Long.  It's just an Identifier of a row, isn't it?
>  The
> >> > algorithm really only works with Item IDs and even with ItemIDs I
> would
> >> > argue they don't need to be Numeric.  Am I missing something?
> >> >
> >> > We have over billion user ids.  So for each ID I need to create a
> >> > corresponding 'long' value in Memory?  Is that what this class is
> doing?
> >> >
> >> > On Mon, Jun 4, 2012 at 2:50 PM, Manuel Blechschmidt <
> >> > Manuel.Blechschmidt@gmx.de> wrote:
> >> >
> >> > > Hi Something,
> >> > > actually this is correct.
> >> > >
> >> > > You can use the MemoryIDMigrator
> >> > >
> >> >
> >>
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/impl/model/MemoryIDMigrator.htmltocreateLongsfrom
your strings.
> >> > >
> >> > > /Manuel
> >> > >
> >> > > On 04.06.2012, at 23:47, Something Something wrote:
> >> > >
> >> > > > Trying to use this class.  Noticed that 'UserID' must be Long.
>  That
> >> > > > doesn't sound right.  Isn't there a way to tell this class that
> the
> >> > > > 'UserID' is String?  Please let me know.  Thanks.
> >> > >
> >> > > --
> >> > > Manuel Blechschmidt
> >> > > M.Sc. IT Systems Engineering
> >> > > Dortustr. 57
> >> > > 14467 Potsdam
> >> > > Mobil: 0173/6322621
> >> > > Twitter: http://twitter.com/Manuel_B
> >> > >
> >> > >
> >> >
> >>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message