mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Solr-recommender for Mahout 0.9
Date Fri, 08 Nov 2013 17:54:17 GMT
For recommendation work, I suggest that it would be better to simply code
out an explicit OR query.




On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler <kkrugler_lists@transpac.com>wrote:

> Hi Pat,
>
> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>
> > Another approach would be to weight the terms in the docs by there
> Mahout similarity strength. But that will be for another day.
> >
> > My current question is whether Lucene looks at word proximity. I see the
> query syntax supports proximity but I don’t see that it is default so
> that’s good.
>
> Based on your description of what you do (generate an OR query of N terms)
> then no, you shouldn't be getting a boost from proximity.
>
> Note that with edismax you can specify a phrase boost, but it will be on
> the entire set of terms being searched, so unlikely to come into play even
> if you were using that.
>
> -- Ken
>
>
> >
> >
> > On Nov 7, 2013, at 12:41 PM, Dyer, James <James.Dyer@ingramcontent.com>
> wrote:
> >
> > Best to my knowledge, Lucene does not care about the position of a
> keyword within a document.
> >
> > You could bucket the ids into several fields.  Then use a dismax query
> to boost the top-tier ids more than then second, etc.
> >
> > A more fine-grained approach would probably involve a custom Similarity
> class that scales the score based on its position in the document.  If we
> did this, it might be simpler to index as 1 single-valued field so each id
> was position+1 rather than position+100, etc.
> >
> > James Dyer
> > Ingram Content Group
> > (615) 213-4311
> >
> >
> > -----Original Message-----
> > From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
> > Sent: Thursday, November 07, 2013 1:46 PM
> > To: user@mahout.apache.org
> > Subject: Re: Solr-recommender for Mahout 0.9
> >
> > Interesting to think about ordering and adjacentness. The index ids are
> sorted by Mahout strength so the first id is the most similar to the row
> key and so forth. But the query is ordered buy recency. In both cases the
> first id is in some sense the most important. Does Solr/Lucene care about
> closeness to the top of doc for queries or indexed docs? I don't recall any
> mention of this.
> >
> > However adjacentness has no meaning in recommendations though I think
> it's used in default queries so I may have to account for that.
> >
> > The object returned is an ordered list of ids. I use only the IDs now
> but there are cases when the contents are also of interest; shopping
> cart/watchlist queries for example.
> >
> > On Nov 7, 2013, at 10:00 AM, Dyer, James <James.Dyer@ingramcontent.com>
> wrote:
> >
> > The multivalued field will obey the "positionIncrementGap" value you
> specify (default=100).  So for querying purposes, those id's will be 100
> (or whatever you specified) positions apart.  So a phrase search for
> adjacent ids would not match, unless you set the slop for >=
> positionIncrementGap.  Other than this, both scenarios index the same.
> >
> > For stored fields, solr returns an array of values for multivalued
> fields, which is convienent when writing a UI.
> >
> > James Dyer
> > Ingram Content Group
> > (615) 213-4311
> >
> >
> > -----Original Message-----
> > From: Dominik Hübner [mailto:contact@dhuebner.com]
> > Sent: Thursday, November 07, 2013 11:23 AM
> > To: user@mahout.apache.org
> > Subject: Re: Solr-recommender for Mahout 0.9
> >
> > Does anyone know what the difference is between keeping the ids in a
> space delimited string and indexing a multivalued field of ids? I recently
> tried the latter since ... it felt right, however I am not sure which of
> both has which advantages.
> >
> > On 07 Nov 2013, at 18:18, Pat Ferrel <pat.ferrel@gmail.com> wrote:
> >
> >> I have dismax (no edismax) but am not using it yet, using the default
> query, which does use 'AND'. I had much the same though as I slept on it.
> Changing to OR is now working much much better. So obvious it almost bit
> me, not good in this case...
> >>
> >> With only a trivially small amount of testing I'd say we have a new
> recommender on the block.
> >>
> >> If anyone would like to help eyeball test the thing let me know
> off-list. There are a few instructions I'll need to give. And it can't
> handle much load right now due to intentional design limits.
> >>
> >>
> >> On Nov 7, 2013, at 6:11 AM, Dyer, James <James.Dyer@ingramcontent.com>
> wrote:
> >>
> >> Pat,
> >>
> >> Can you give us the query it generates when you enter "vampire werewolf
> zombie", q/qt/defType ?
> >>
> >> My guess is you're using the default query parser with "q.op=AND" , or,
> you're using dismax/edismax with a high "mm" (min-must-match) value.
> >>
> >> James Dyer
> >> Ingram Content Group
> >> (615) 213-4311
> >>
> >>
> >> -----Original Message-----
> >> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
> >> Sent: Wednesday, November 06, 2013 5:53 PM
> >> To: ssc@apache.org Schelter; user@mahout.apache.org
> >> Subject: Re: Solr-recommender for Mahout 0.9
> >>
> >> Done,
> >>
> >> BTW I have the thing running on a demo site but am getting very poor
> results that I think are related to the Solr setup. I'd appreciate any
> ideas.
> >>
> >> The sample data has 27,000 items and something like 4000 users. The
> preference data is fairly dense since the users are professional reviewers
> and the items videos.
> >>
> >> 1) The number of item-item similarities that are kept is 100. Is this a
> good starting point? Ted, do you recall how many you used before?
> >> 2) The query is a simple text query made of space delimited video id
> strings. These are the same ids as are stored in the item-item similarity
> docs that Solr indexes.
> >>
> >> Hit thumbs up on one video you you get several recommendations. Hit
> thumbs up on several videos you get no recs. I'm either using the wrong
> query type or have it set up to be too restrictive. As I read through the
> docs if someone has a suggestion or pointer I'd appreciate it.
> >>
> >> BTW the same sort of thing happens with Title search. Search for
> "vampire werewolf zombie" you get no results, search for "zombie" you get
> several.
> >>
> >> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ssc@apache.org> wrote:
> >>
> >> Hi Pat,
> >>
> >> can you create issues for 1) and 2) ? Then I will try to get this into
> >> trunk asap.
> >>
> >> Best,
> >> Sebastian
> >>
> >> On 06.11.2013 19:13, Pat Ferrel wrote:
> >>> Trying to integrate the Solr-recoemmender with the latest Mahout
> snapshot. The project uses a modified RecommenderJob because it needs
> SequenceFile output and to get the location of the preparePreferenceMatrix
> directory. If #1 and #2 are addressed I can remove the modified Mahout code
> from the project and rely on the default implementations in Mahout 0.9. #3
> is a longer term issue related to the creation of a CrossRowSimilarityJob.
> >>>
> >>> I have dropped the modified code from the Solr-recommender project and
> have a modified build of the current Mahout 0.9 snapshot. If the following
> changes are made to Mahout I can test and release a Mahout 0.9 version of
> the Solr-recommender.
> >>>
> >>> 1. Option to change RecommenderJob output format
> >>>
> >>> Can someone add an option to output a SequenceFile. I modified the
> code to do the following, note the SequenceFileOutputFormat.class as the
> last parameter but this should really be determined with an option I think.
> >>>
> >>> Job aggregateAndRecommend = prepareJob(
> >>>         new Path(aggregateAndRecommendInput), outputPath,
> SequenceFileInputFormat.class,
> >>>         PartialMultiplyMapper.class, VarLongWritable.class,
> PrefAndSimilarityColumnWritable.class,
> >>>         AggregateAndRecommendReducer.class, VarLongWritable.class,
> RecommendedItemsWritable.class,
> >>>         SequenceFileOutputFormat.class);
> >>>
> >>> 2. Visibility of preparePreferenceMatrix directory location
> >>>
> >>> The Solr-recommender needs to find where the RecommenderJob is putting
> it's output.
> >>>
> >>> Mahout 0.8 RecommenderJob code was:
> >>> public static final String DEFAULT_PREPARE_DIR =
> "preparePreferenceMatrix";
> >>>
> >>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
> inline in the code:
> >>> Path prepPath = getTempPath("preparePreferenceMatrix");
> >>>
> >>> This change to Mahout 0.9 works:
> >>> public static final String DEFAULT_PREPARE_DIR =
> "preparePreferenceMatrix";
> >>> and
> >>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
> >>>
> >>> You could also make this a getter method on the RecommenderJob Class
> instead of using a public constant.
> >>>
> >>> 3. Downsampling
> >>>
> >>> The downsampling for maximum prefs per user has been moved from
> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses
> matrix math instead of RSJ so it will no longer support downsampling until
> there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
> >>>
> >>>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >
> >
> >
> >
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message