mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johannes Schulte <johannes.schu...@gmail.com>
Subject Re: Mix of Content Based and Collaborative Filtering
Date Tue, 06 Nov 2012 05:16:26 GMT
Good Morning!

is it possible you are mixing up payloads and stored fields? The latter
ones are not indexed and can only be used for the top n results. Maybe
we're talking about different things..

With the question of how to include the similarities I was actually asking
for the way to include the scores of say a LLR value into an index. Do you
just take the top x related items and throw the similarity score away?

As for the performance: Yes, sorry, that was a little bragging and not
really informative :) .




On Mon, Nov 5, 2012 at 11:38 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> On Mon, Nov 5, 2012 at 12:06 PM, Johannes Schulte <
> johannes.schulte@gmail.com> wrote:
>
> >
> > do you really mean payloads? Because i consider them part of the index as
> > they are stored per position and can be accessed during scoring.
> >
>
> I had the impression that they were not indexed.  They are definitely
> available if you pull the document, but for high speed scoring, you should
> not do that if you possibly can avoid it.
>
>
> > How would you then incorporate the similarities in an index. With a faked
> > term frequency?
> >
>
> You don't actually need to fake the term frequency.  You can do that if you
> really want to adjust the weightings, but the native scoring in most
> retrieval engines is close enough to what you want that the benefits of
> coherent integration of multiple  kinds of data over-powers the defects
> introduced (and it isn't clear that they actually are defects).
>
>
>
> > I always felt that payloads are a very natural and fast way of storing
> big
> > item-to-item relationships with additional content. You dont have to load
> > everything into memory or use something like a database like you have to
> do
> > with the current Mahout DataModel.
>
>
> I agree that databases are disasters for this.
>
> But can you access the payload without cracking open the document store?
>
>
> > Instead you have the caching goodness of
> > the lucene mmap directories without having to worry about heap. At least
> > we're encountering sub miliseconds response time this way...
> >
>
> This is impressive.  At what scale?
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message