mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: ItemSimilarityJob
Date Thu, 12 Aug 2010 20:06:14 GMT
LSH and minhash clustering both find cases of very large overlaps.  We need
more subtle overlaps to be retained.

Thus, my inclination which is, as yet, unsupported by any evidence other
than what has worked for me in the past, would be to optimize the system we
have.

On Thu, Aug 12, 2010 at 12:45 PM, alex lin <alexlin.us@gmail.com> wrote:

> Hi,
>
> Below seems interesting?
> SIGIR 2010 http://www.lsdsir.org/wp-content/uploads/2010/05/lsdsir10-2.pdf
> which uses LSH + Hadoop
>
> Alex
>
> On Thu, Aug 12, 2010 at 3:32 PM, Gökhan Çapan <gkhncpn@gmail.com> wrote:
>
> > Hi Sebastian,
> >
> > I remember the birth of this "co-occurrence based distributed
> > recommendation" idea, I was there:). That's why I know the general
> concepts
> > behind the idea. However, I haven't looked at the Mahout code lately, and
> I
> > don't know the details about how co-occurrence computation is
> implemented.
> > Choosing either pairs or stripes approach is a design choice while
> > implementing a co-occurrence computation (and other similar ones).
> Although
> > these have slight differences in implementation phase, size of
> intermediate
> > data and the time complexity of the algorithms are very different from
> each
> > other. If the problem is the size of intermediate data produced while
> > computing the co-occurrence matrix, it's worth to try the "other"
> approach.
> >
> > In addition, MinHashing may be another option to produce all candidate
> > similar pairs according to Jaccard similarity without looking all pairs.
> It
> > may approximately be implemented using MapReduce. Actually I have
> > implemented one, to detect near duplicate documents in a large document
> > set,
> > as well as to find similar items according to users' binary ratings; and
> it
> > yields to very good results, in terms of  both accuracy and performance.
> > This method is also a part of Google's news recommendation engine.
> >
> > On Thu, Aug 12, 2010 at 9:11 PM, Sebastian Schelter <
> > ssc.open@googlemail.com
> > > wrote:
> >
> > > Hi Gökhan,
> > >
> > > I had a quick look at the paper and the "stripes" approach looks
> > > interesting though I'm not sure whether we can apply it to the
> > > RowSimilarityJob.
> > >
> > > I think when we want to improve the performance of the
> ItemSimilarityJob
> > > we should go another path that follows some thoughts by Ted Dunning.
> > > IIRC he said that you don't really learn anything new about an item
> > > after seeing a certain number of preferences and thus it should be
> > > sufficient to look at a fixed number of them at maximum per item.
> > > RecommenderJob is already limiting the number of preferences considered
> > > per item with MaybePruneRowsMapper and ItemSimilarityJob should adapt
> > > that option too.
> > >
> > > The algorithm is based on the paper "Elsayed et al: Pairwise Document
> > > Similarity in Large Collections with MapReduce"
> > > (
> > >
> >
> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> > > )
> > > which seems to use the "pairs" approach described in your paper as far
> > > as I can judge that. IIRC the authors if the papers remove frequently
> > > occurring words to achieve a linear runtime which would correspond the
> > > "maxNumberOfPreferencesPerItem" approach MAHOUT-460 suggests.
> > >
> > > --sebastian
> > >
> > >
> > >
> > > Am 12.08.2010 16:45, schrieb Gökhan Çapan:
> > > > Hi,
> > > > I haven't seen the code, but maybe Mahout needs some optimization
> while
> > > > computing item-item co-occurrences. It may be re-implemented using
> > > "stripes"
> > > > approach using in-mapper combining if it is not. It can be found at:
> > > >
> > > >    1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
> > > >
> > > > If it already is, sorry for the post.
> > > >
> > > > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
> > > > charly.lizarralde@gmail.com> wrote:
> > > >
> > > >
> > > >> Sebastian, thank's for the reply.  The step name is*
> > > >> :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
> > > >> map task
> > > >> takes around 10 hours to finish.
> > > >>
> > > >> Reduce task dir
> > > >>
> > > >>
> > >
> >
> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
> > > >> has map output files ( files like map_2.out) and each one is 5GB in
> > > size.
> > > >>
> > > >> I have been looking at the code and saw what you describe in the
> > e-mail.
> > > It
> > > >> makes sense. But still 160 GB of intermediate info from a 2.6 GB
> input
> > > file
> > > >> still makes me wonder if something is wrong.
> > > >>
> > > >> Should I just wait for the patch?
> > > >> Thanks again!
> > > >> Charly
> > > >>
> > > >> On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
> > > >> ssc.open@googlemail.com
> > > >>
> > > >>> wrote:
> > > >>>
> > > >>
> > > >>> Hi Charly,
> > > >>>
> > > >>> can you tell which Map/Reduce step was executed last before you
ran
> > out
> > > >>> of disk space?
> > > >>>
> > > >>> I'm not familiar with the Netflix dataset and can only guess what
> > > >>> happened, but I would say that you ran out of diskspace because
> > > >>> ItemSimilarityJob currently uses all preferences to compute the
> > > >>> similarities. This makes it scale in the square of the number
of
> > > >>> occurrences of the most popular item, which is a bad thing if
that
> > > >>> number is huge. We need a way to limit the number of preferences
> > > >>> considered per item, there is already a ticket for this (
> > > >>> https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to
> > > provide
> > > >>> a patch in the next days.
> > > >>>
> > > >>> --sebastian
> > > >>>
> > > >>>
> > > >>>
> > > >>> Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> > > >>>
> > > >>>> Hi, I am testing ItemSimilarityJob with Netflix data (2.6
GB) and
> I
> > > >>>>
> > > >> have
> > > >>
> > > >>>> just ran out of disk space (160 GB) in my mapred.local.dir
when
> > > running
> > > >>>> RowSimilarityJob.
> > > >>>>
> > > >>>> Is this normal behaviour? How can I improve this?
> > > >>>>
> > > >>>> Thanks!
> > > >>>> Charly
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > Gökhan Çapan
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message