mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Mix of Content Based and Collaborative Filtering
Date Mon, 05 Nov 2012 22:38:17 GMT
On Mon, Nov 5, 2012 at 12:06 PM, Johannes Schulte <> wrote:

> do you really mean payloads? Because i consider them part of the index as
> they are stored per position and can be accessed during scoring.

I had the impression that they were not indexed.  They are definitely
available if you pull the document, but for high speed scoring, you should
not do that if you possibly can avoid it.

> How would you then incorporate the similarities in an index. With a faked
> term frequency?

You don't actually need to fake the term frequency.  You can do that if you
really want to adjust the weightings, but the native scoring in most
retrieval engines is close enough to what you want that the benefits of
coherent integration of multiple  kinds of data over-powers the defects
introduced (and it isn't clear that they actually are defects).

> I always felt that payloads are a very natural and fast way of storing big
> item-to-item relationships with additional content. You dont have to load
> everything into memory or use something like a database like you have to do
> with the current Mahout DataModel.

I agree that databases are disasters for this.

But can you access the payload without cracking open the document store?

> Instead you have the caching goodness of
> the lucene mmap directories without having to worry about heap. At least
> we're encountering sub miliseconds response time this way...

This is impressive.  At what scale?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message