mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Solr-recommender for Mahout 0.9
Date Sun, 17 Nov 2013 23:36:07 GMT
Eventually I’d like to get MAP built into the solr-recommender. Used it at a client who had
good data. It was very helpful for exploring what data was useful and what wasn’t. We’d
run map with and without detail-view data for instance and take the MAP as a measure of how
predictive the data was. In our case the MAP@ numbers went down with purchase and detail-view
mixed together. That was why I got interested in the cross-action recommender—as a way to
scrub less predictive actions. Didn’t finish it before I lost access to the data unfortunately.

What form of precision calc will you use? Obviously we used mean average precision at different
numbers of recommendations, which had the effect of producing a fall-off curve. The curve,
we took, as a measure of how well our ranking was working.

On Nov 17, 2013, at 10:47 AM, Ken Krugler <kkrugler_lists@transpac.com> wrote:

Hi Pat,

On Nov 13, 2013, at 4:43pm, Pat Ferrel <pat.ferrel@gmail.com> wrote:

> Ever done an offline precision calc?

No, sorry.

I do (finally) have one client with some data that could be used to calculate precision, and
a willingness to pay for the work, so I'm hoping to include details on that in my next blog
post about text feature selection.

-- Ken


>> On Nov 13, 2013, at 1:39 PM, Ken Krugler <kkrugler_lists@transpac.com> wrote:
>> 
>> Hi Pat,
>> 
>>> On Nov 13, 2013, at 9:21am, Pat Ferrel <pat@occamsmachete.com> wrote:
>>> 
>>> A version is now checked in that uses mahout 0.9. Haven’t tested it on a cluster
yet, only locally. I have to upgrade my cluster to Hadoop 1.2.1, which takes some time.
>>> 
>>> Saw the Strata slides from Ted touting dithering of results, which I’ll implement.
>>> 
>>> Ken, did you have anything specific for "And usually I just use Solr to generate
a candidate list, then I do more specific scoring to find the N best form N*4 candidates”?
>> 
>> If I'm looking for the top N best matches, I'll do a Solr query with rows=N*4.
>> 
>> Then I use all of the data from these potential matches, and calculate a more sophisticated
similarity score (e.g. adding a weighting based on the user's activity level) between my target
and these candidates.
>> 
>> Regards,
>> 
>> -- Ken
>> 
>>> 
>>> Was planning to try boosting by something like genre/category in the recs query.
For instance, in the demo data, each item will soon have a set of tags (actually genre names)
so these could be a field being queried along with the item-item links. The query for recs
would then include the user history against the item-item links, and the average genre tags
preferred by the user against item genre tags. This would return recs skewed towards the user’s
genre preference.
>>> 
>>> Another way this could be used is when showing similar items. You’d have the
tags for the item being viewed and so could use them to skew towards items with similar tags.
I think this works but would turn similar items from a lookup (they are pre-calculated by
Mahout) into another Solr query.
>>> 
>>> 
>>> 
>>> On Nov 8, 2013, at 1:27 PM, Pat Ferrel <pat@occamsmachete.com> wrote:
>>> 
>>> Not planning to do anything with weights at present. An ORed query should suffice
for the time being and Solr weights. There are a good list of ways to do this later if it
warrants an experiment. Thanks.
>>> 
>>> Have, similar items as input, recommendations from user “likes”, and just
got recs from recently viewed working. Once you have online recs from the pre-calculated model
experimenting is super easy. The next step will be to get more metadata ingested so we can
try boosting by context genre, or recent genre viewed, which is sort of in line with "more
specific scoring to find the N best from N*4 candidates”. Also want to do what Ted calls
dithering to vary the choices you see.
>>> 
>>> On Nov 8, 2013, at 10:10 AM, Ken Krugler <kkrugler_lists@transpac.com>
wrote:
>>> 
>>> One other thing I should have mentioned is that if you care about setting weights
on incoming terms, you can boost them using the ^<value> syntax.
>>> 
>>> E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0…"
>>> 
>>> If you want to account for weights of terms in the index, it's a bit harder.
You can do simple boosting by replicating terms, or you can use payload-based boosting, or
you could code up your own Similarity class that takes advantage of side-channel data.
>>> 
>>> But in my experience the gain from applying weights to terms int he index isn't
very significant.
>>> 
>>> And usually I just Solr to generate a candidate list, then I do more specific
scoring to find the N best form N*4 candidates.
>>> 
>>> -- Ken
>>> 
>>>> On Nov 8, 2013, at 9:54am, Ted Dunning <ted.dunning@gmail.com> wrote:
>>>> 
>>>> For recommendation work, I suggest that it would be better to simply code
>>>> out an explicit OR query.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler <kkrugler_lists@transpac.com>wrote:
>>>> 
>>>>> Hi Pat,
>>>>> 
>>>>>> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pat.ferrel@gmail.com>
wrote:
>>>>>> 
>>>>>> Another approach would be to weight the terms in the docs by there
>>>>> Mahout similarity strength. But that will be for another day.
>>>>>> 
>>>>>> My current question is whether Lucene looks at word proximity. I
see the
>>>>> query syntax supports proximity but I don’t see that it is default
so
>>>>> that’s good.
>>>>> 
>>>>> Based on your description of what you do (generate an OR query of N terms)
>>>>> then no, you shouldn't be getting a boost from proximity.
>>>>> 
>>>>> Note that with edismax you can specify a phrase boost, but it will be
on
>>>>> the entire set of terms being searched, so unlikely to come into play
even
>>>>> if you were using that.
>>>>> 
>>>>> -- Ken
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Nov 7, 2013, at 12:41 PM, Dyer, James <James.Dyer@ingramcontent.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Best to my knowledge, Lucene does not care about the position of
a
>>>>> keyword within a document.
>>>>>> 
>>>>>> You could bucket the ids into several fields.  Then use a dismax
query
>>>>> to boost the top-tier ids more than then second, etc.
>>>>>> 
>>>>>> A more fine-grained approach would probably involve a custom Similarity
>>>>> class that scales the score based on its position in the document.  If
we
>>>>> did this, it might be simpler to index as 1 single-valued field so each
id
>>>>> was position+1 rather than position+100, etc.
>>>>>> 
>>>>>> James Dyer
>>>>>> Ingram Content Group
>>>>>> (615) 213-4311
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>>>>> Sent: Thursday, November 07, 2013 1:46 PM
>>>>>> To: user@mahout.apache.org
>>>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>>>> 
>>>>>> Interesting to think about ordering and adjacentness. The index ids
are
>>>>> sorted by Mahout strength so the first id is the most similar to the
row
>>>>> key and so forth. But the query is ordered buy recency. In both cases
the
>>>>> first id is in some sense the most important. Does Solr/Lucene care about
>>>>> closeness to the top of doc for queries or indexed docs? I don't recall
any
>>>>> mention of this.
>>>>>> 
>>>>>> However adjacentness has no meaning in recommendations though I think
>>>>> it's used in default queries so I may have to account for that.
>>>>>> 
>>>>>> The object returned is an ordered list of ids. I use only the IDs
now
>>>>> but there are cases when the contents are also of interest; shopping
>>>>> cart/watchlist queries for example.
>>>>>> 
>>>>>>> On Nov 7, 2013, at 10:00 AM, Dyer, James <James.Dyer@ingramcontent.com>
>>>>>> wrote:
>>>>>> 
>>>>>> The multivalued field will obey the "positionIncrementGap" value
you
>>>>> specify (default=100).  So for querying purposes, those id's will be
100
>>>>> (or whatever you specified) positions apart.  So a phrase search for
>>>>> adjacent ids would not match, unless you set the slop for >=
>>>>> positionIncrementGap.  Other than this, both scenarios index the same.
>>>>>> 
>>>>>> For stored fields, solr returns an array of values for multivalued
>>>>> fields, which is convienent when writing a UI.
>>>>>> 
>>>>>> James Dyer
>>>>>> Ingram Content Group
>>>>>> (615) 213-4311
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Dominik Hübner [mailto:contact@dhuebner.com]
>>>>>> Sent: Thursday, November 07, 2013 11:23 AM
>>>>>> To: user@mahout.apache.org
>>>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>>>> 
>>>>>> Does anyone know what the difference is between keeping the ids in
a
>>>>> space delimited string and indexing a multivalued field of ids? I recently
>>>>> tried the latter since ... it felt right, however I am not sure which
of
>>>>> both has which advantages.
>>>>>> 
>>>>>>> On 07 Nov 2013, at 18:18, Pat Ferrel <pat.ferrel@gmail.com>
wrote:
>>>>>>> 
>>>>>>> I have dismax (no edismax) but am not using it yet, using the
default
>>>>> query, which does use 'AND'. I had much the same though as I slept on
it.
>>>>> Changing to OR is now working much much better. So obvious it almost
bit
>>>>> me, not good in this case...
>>>>>>> 
>>>>>>> With only a trivially small amount of testing I'd say we have
a new
>>>>> recommender on the block.
>>>>>>> 
>>>>>>> If anyone would like to help eyeball test the thing let me know
>>>>> off-list. There are a few instructions I'll need to give. And it can't
>>>>> handle much load right now due to intentional design limits.
>>>>>>> 
>>>>>>> 
>>>>>>> On Nov 7, 2013, at 6:11 AM, Dyer, James <James.Dyer@ingramcontent.com>
>>>>> wrote:
>>>>>>> 
>>>>>>> Pat,
>>>>>>> 
>>>>>>> Can you give us the query it generates when you enter "vampire
werewolf
>>>>> zombie", q/qt/defType ?
>>>>>>> 
>>>>>>> My guess is you're using the default query parser with "q.op=AND"
, or,
>>>>> you're using dismax/edismax with a high "mm" (min-must-match) value.
>>>>>>> 
>>>>>>> James Dyer
>>>>>>> Ingram Content Group
>>>>>>> (615) 213-4311
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>>>>>> Sent: Wednesday, November 06, 2013 5:53 PM
>>>>>>> To: ssc@apache.org Schelter; user@mahout.apache.org
>>>>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>>>>> 
>>>>>>> Done,
>>>>>>> 
>>>>>>> BTW I have the thing running on a demo site but am getting very
poor
>>>>> results that I think are related to the Solr setup. I'd appreciate any
>>>>> ideas.
>>>>>>> 
>>>>>>> The sample data has 27,000 items and something like 4000 users.
The
>>>>> preference data is fairly dense since the users are professional reviewers
>>>>> and the items videos.
>>>>>>> 
>>>>>>> 1) The number of item-item similarities that are kept is 100.
Is this a
>>>>> good starting point? Ted, do you recall how many you used before?
>>>>>>> 2) The query is a simple text query made of space delimited video
id
>>>>> strings. These are the same ids as are stored in the item-item similarity
>>>>> docs that Solr indexes.
>>>>>>> 
>>>>>>> Hit thumbs up on one video you you get several recommendations.
Hit
>>>>> thumbs up on several videos you get no recs. I'm either using the wrong
>>>>> query type or have it set up to be too restrictive. As I read through
the
>>>>> docs if someone has a suggestion or pointer I'd appreciate it.
>>>>>>> 
>>>>>>> BTW the same sort of thing happens with Title search. Search
for
>>>>> "vampire werewolf zombie" you get no results, search for "zombie" you
get
>>>>> several.
>>>>>>> 
>>>>>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ssc@apache.org>
wrote:
>>>>>>> 
>>>>>>> Hi Pat,
>>>>>>> 
>>>>>>> can you create issues for 1) and 2) ? Then I will try to get
this into
>>>>>>> trunk asap.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Sebastian
>>>>>>> 
>>>>>>>> On 06.11.2013 19:13, Pat Ferrel wrote:
>>>>>>>> Trying to integrate the Solr-recoemmender with the latest
Mahout
>>>>> snapshot. The project uses a modified RecommenderJob because it needs
>>>>> SequenceFile output and to get the location of the preparePreferenceMatrix
>>>>> directory. If #1 and #2 are addressed I can remove the modified Mahout
code
>>>>> from the project and rely on the default implementations in Mahout 0.9.
#3
>>>>> is a longer term issue related to the creation of a CrossRowSimilarityJob.
>>>>>>>> 
>>>>>>>> I have dropped the modified code from the Solr-recommender
project and
>>>>> have a modified build of the current Mahout 0.9 snapshot. If the following
>>>>> changes are made to Mahout I can test and release a Mahout 0.9 version
of
>>>>> the Solr-recommender.
>>>>>>>> 
>>>>>>>> 1. Option to change RecommenderJob output format
>>>>>>>> 
>>>>>>>> Can someone add an option to output a SequenceFile. I modified
the
>>>>> code to do the following, note the SequenceFileOutputFormat.class as
the
>>>>> last parameter but this should really be determined with an option I
think.
>>>>>>>> 
>>>>>>>> Job aggregateAndRecommend = prepareJob(
>>>>>>>>  new Path(aggregateAndRecommendInput), outputPath,
>>>>> SequenceFileInputFormat.class,
>>>>>>>>  PartialMultiplyMapper.class, VarLongWritable.class,
>>>>> PrefAndSimilarityColumnWritable.class,
>>>>>>>>  AggregateAndRecommendReducer.class, VarLongWritable.class,
>>>>> RecommendedItemsWritable.class,
>>>>>>>>  SequenceFileOutputFormat.class);
>>>>>>>> 
>>>>>>>> 2. Visibility of preparePreferenceMatrix directory location
>>>>>>>> 
>>>>>>>> The Solr-recommender needs to find where the RecommenderJob
is putting
>>>>> it's output.
>>>>>>>> 
>>>>>>>> Mahout 0.8 RecommenderJob code was:
>>>>>>>> public static final String DEFAULT_PREPARE_DIR =
>>>>> "preparePreferenceMatrix";
>>>>>>>> 
>>>>>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
>>>>> inline in the code:
>>>>>>>> Path prepPath = getTempPath("preparePreferenceMatrix");
>>>>>>>> 
>>>>>>>> This change to Mahout 0.9 works:
>>>>>>>> public static final String DEFAULT_PREPARE_DIR =
>>>>> "preparePreferenceMatrix";
>>>>>>>> and
>>>>>>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>>>>>>>> 
>>>>>>>> You could also make this a getter method on the RecommenderJob
Class
>>>>> instead of using a public constant.
>>>>>>>> 
>>>>>>>> 3. Downsampling
>>>>>>>> 
>>>>>>>> The downsampling for maximum prefs per user has been moved
from
>>>>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses
>>>>> matrix math instead of RSJ so it will no longer support downsampling
until
>>>>> there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>>>>> 
>>>>> --------------------------
>>>>> Ken Krugler
>>>>> +1 530-210-6378
>>>>> http://www.scaleunlimited.com
>>>>> custom big data solutions & training
>>>>> Hadoop, Cascading, Cassandra & Solr
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --------------------------
>>>>> Ken Krugler
>>>>> +1 530-210-6378
>>>>> http://www.scaleunlimited.com
>>>>> custom big data solutions & training
>>>>> Hadoop, Cascading, Cassandra & Solr
>>> 
>>> --------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://www.scaleunlimited.com
>>> custom big data solutions & training
>>> Hadoop, Cascading, Cassandra & Solr
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://www.scaleunlimited.com
>>> custom big data solutions & training
>>> Hadoop, Cascading, Cassandra & Solr
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Mime
View raw message