mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johannes Schulte <johannes.schu...@gmail.com>
Subject Re: Implicit preferences
Date Mon, 11 Feb 2013 19:59:04 GMT
@Ted: Thanks for the input!

@Ken
I think the boosting is buried in getSpanScore() calling super.score() but
a quick look at the code didn't give me the exact reason. I remember me
setting includeSpanScore to false and the boost disappearing. The use of
payloads however was only necessary because i put the similarity scores in
there and then summed them up. In the "new" system i am gonna use no
payloads (so far).

Thanks for the hint with the int field. I am using plain lucene but the
semantics should be the same. I didnt know you could turn off the bonus
features




On Mon, Feb 11, 2013 at 7:27 PM, Ken Krugler <kkrugler_lists@transpac.com>wrote:

>
> On Feb 11, 2013, at 1:57am, Johannes Schulte wrote:
>
> > @Ken
> > Thanks for the hints...
> > I am coming from a payload based system so I am aware if them, however in
> > the lucene 3.6 branch boosting and payloads didn't work together (if you
> > set PayloadTermQuery.setIncludeSpanScore to false, they were ignored)
>
> I assume you're talking about passing false for the includeSpanScore
> parameter in the PayloadTermQuery constructor, yes?
>
> Anyway, I'm surprised you ran into this issue. In the 3.6.0 source for
> PayloadTermQuery, the getScore() method is:
>
>       @Override
>       public float score() throws IOException {
>
>         return includeSpanScore ? getSpanScore() * getPayloadScore()
>             : getPayloadScore();
>       }
>
> So I would assume that you'd get the payload score (as expected). But I
> haven't actually tried to validate this.
>
> > Besides that, there is no performance issue here so far so it's probably
> a
> > fine way to go, i was just curious...as for the IntField /  TrieIntField,
> > all the  range query / ordering benefits of it are overhead since the
> > integers just represent random indices into a vector. I might look into
> > indexing the integer bytes rather than the string separation…
>
> I was proposing you use:
>
>     <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
> positionIncrementGap="0"/>
>
> Which doesn't generate the extra values that make range queries faster,
> but should store the data more efficiently.
>
> -- Ken
>
> > @Ted
> > You are probably right with chosing 1 as term frequency, i forgot that
> the
> > most interesting information comes from the idf probably and using
> > cooccurrence counts as term frequency might make the combination with
> text
> > searches infeasible since the values lie in some totally different range.
> > Also I forgot that idf is per field so i might go for separating the
> hashed
> > values into their originating fields (search tern, item_id, category_id)
> .
> > This would still alllow to recombine them later when a use profile has to
> > be constructed.
> >
> > I like to threshold with LLR.  That gives me a binary matrix.  Then I
> >> directly index that.
> >> The search engine provides very nice weights at this point.  I don't
> feel
> >> the need to adjust those weights because they have roughly the same
> form as
> >> learned weights are likely to have and because learning those weights
> would
> >> almost certainly result in over-fitting unless I go to quite a lot of
> >> trouble.
> >> Also, I have heard that at least one head-to-head test found that the
> >> native Solr term weighting actually out-performed several more intricate
> >> and explicit weighting schemes.  That can't be taken as evidence that
> >> Solr's weightings would perform better than whatever you have in mind,
> but
> >> it does provide interesting meta-evidence that the probability that a
> >> reasonably smart dev team is definitely not guaranteed to beat Solr's
> >> weighting by a large margin.  When you sit down to architect your
> system,
> >> you need to make decisions about where to spend your time and evidence
> like
> >> that is helpful to guess how much effort it would take to achieve
> different
> >> levels of performance.
> >
> >
> >
> > I am also thresholding the counts with LLR. Every time i do this I take a
> > threshold of 10 since I loosely remember it  being about the 99% margin
> of
> > confidence in the chi square distribution. I got no clue however if
> anybody
> > wants something like 99% for recommendations or if 50% might be a better
> > value. What's your experience on that?
> >
> > And do you apply a limit on the total number of docs per term, since
> there
> > could be big boolean queries tearing down the performance?
> >
> > Thanks for all the input!
> >
> >
> >
> > On Mon, Feb 11, 2013 at 7:20 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >
> >> On Sun, Feb 10, 2013 at 3:39 PM, Johannes Schulte <
> >> johannes.schulte@gmail.com> wrote:
> >>
> >>> ...
> >>> i am currently implementing a system of the same kind, LLR sparsified
> >>> "term"-cooccurrence vectors in lucene (since not a day goes by where i
> >> see
> >>> Ted praising this).
> >>>
> >>
> >> (turns red)
> >>
> >>
> >>> There are not only views and purchases, but also search terms, facets
> >> and a
> >>> lot more textual information to be included in the cooccurrence matrix
> >> (as
> >>> "input").
> >>> That's why i went with the feature hashing framework in mahout. This
> >> gives
> >>> small (hd/mem) user profiles and allows for reusing the vectors for
> click
> >>> prediction and/or clustering.
> >>
> >>
> >> This is a reasonable choice.  For recommendations, you might want to use
> >> direct encoding since it can be simpler to build a search index for
> >> recommending.
> >>
> >>
> >>> The main difference is that there's only two
> >>> fields in lucene with a lot of terms (Numbers), representing the
> >> features.
> >>> Two fields because i think predicting views (besides purchases) might
> in
> >>> some cases be better than predicting nothing.
> >>>
> >>
> >> OK.
> >>
> >>
> >>> I don't think it  should make a big differing in scoring because in a
> >>> vector space model used by most engines it's just, well a vector space
> >> and
> >>> i don't know if the field norm make sense after stripping values from
> the
> >>> term vectors with the LLR threshold.
> >>>
> >>
> >> Having separate fields is going to give separate total term counts.
>  That
> >> seems better to me, but I have to confess I have never rigorously tested
> >> that.
> >>
> >>
> >>> @Ted
> >>>> It is handy to simply use the binary values of the sparsified versions
> >> of
> >>>> these and let the search engine handle the weighting of different
> >>>> components at query time.
> >>>
> >>> Do you really want to omit the cooccurrence counts which would become
> the
> >>> term frequecies? how would the engine then weight different inputs
> >> against
> >>> each other?
> >>>
> >>
> >> I like to threshold with LLR.  That gives me a binary matrix.  Then I
> >> directly index that.
> >>
> >> The search engine provides very nice weights at this point.  I don't
> feel
> >> the need to adjust those weights because they have roughly the same
> form as
> >> learned weights are likely to have and because learning those weights
> would
> >> almost certainly result in over-fitting unless I go to quite a lot of
> >> trouble.
> >>
> >> Also, I have heard that at least one head-to-head test found that the
> >> native Solr term weighting actually out-performed several more intricate
> >> and explicit weighting schemes.  That can't be taken as evidence that
> >> Solr's weightings would perform better than whatever you have in mind,
> but
> >> it does provide interesting meta-evidence that the probability that a
> >> reasonably smart dev team is definitely not guaranteed to beat Solr's
> >> weighting by a large margin.  When you sit down to architect your
> system,
> >> you need to make decisions about where to spend your time and evidence
> like
> >> that is helpful to guess how much effort it would take to achieve
> different
> >> levels of performance.
> >>
> >> And, if anyone knows a
> >>> 1. smarter way to index the cooccurrence counts in lucene than a
> >>> tokenstream that emits a word k times for a cooccurrence count of k
> >>>
> >>
> >> You can use payloads or you can boost individual terms.
> >>
> >>
> >>> 2. way to avoid treating the (hashed) vector column indices as terms
> but
> >>> reusing them? It's a bit weird hashing to an int and then having the
> >> lucene
> >>> term dictionary treating them as string, mapping to another int
> >>>
> >>
> >> Why do we care about this?  These tokens get put onto documents that
> have
> >> additional data to help them make sense, but why do we care if the
> tokens
> >> look like numbers?
> >>
>
> --------------------------------------------
> http://about.me/kkrugler
> +1 530-210-6378
>
>
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message