lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Bennett <>
Subject Revisiting IDF Problems and Index Slices
Date Thu, 06 Aug 2009 18:27:52 GMT
I'm investigating a problem I bet some of you have hit before, and exploring
several options to address it.  I suspect that this specific IDF scenario is
common enough that it even has a name, though I'm not what it would be

The scenario:

Suppose you have a search application focused on the products that you sell:
* You initially launch with just one product area for your "widget".
* You add a stop words list for really common words, say "a", "the", "and",

Looking at the search logs you discover:
* Other common English words being used, such as "i", "how" and "can".
These are automatically getting low IDF scores, since they are also common
in your documents, so normally you wouldn't worry.
* But you also notice that the words "widget" is getting a very low score,
since it is mentioned in virtually every document.  In fact, since widget is
more common than the word "can", it is actually getting a *lower* IDF score
than the word "can", giving some very unsatisfying search results.

Some queries include other "good" words like "replace" and "battery", so
those searches are doing OK; those terms are relatively rare, so they
properly get the most weight.

But some searches have nothing but "bad" words, for example "how can I get a
widget ?"
* The word "a" is stopped out.
* The words "how", "can", "i" and "get" are common English words, so they
get a low score.
* And the word "widget", your flagship product, is also getting a low score,
since it's in every document.
So some queries are ENTIRELY composed of "low quality" words, and the search
engine gives somewhat random results.

Some ideas to address this:

Idea 1: Do nothing to IDF and just deploy the content for OTHER products
ASAP; the IDF logic will automatically fix the problem.

With content only about widgets, 100% of the docs have that score.

But if you deploy content for 9 more products, giving a total of 10
products, then only 1/10th of them will tend to have the word "widget".  But
common English words will continue, on average, to appear in almost all
documents, across all product areas.  Therefore they will continue to get
low scores.  So relative to the words "i" and "can", the word "widget" will
automatically rise up out of the IDF muck.

And when content for all 100 of your products are deployed, "widget" will
again have been boosted even further, compared to other common English

Though long term I have faith in this strategy, suppose I only had content
ready for two other products, say "cogs" and "hookas", so the term "widget"
will get a bit of a boost, but not by as much as I'd like.

Idea 2: Compare terms in product "slices" and artificially boost them.

In the Widget content there's references to related accessories like the
wBracket and wCase.

Whereas in the Hooka content, I rarely see mention of Widgets or wCases, but
I do still see the English terms "i" and "can", and lots of stuff about

2a: So I do a histogram for each of the 3 products, and look for terms that
are common in one but not in the others.

2b: Or I do a histogram of all terms, and then a histogram of terms for each
of the 3 product slices.  And I compare those lists.

Of course we don't want to go overboard.  If somebody is in the Widget forum
and asks "how do I change the widget battery?" then, although the term
Widget is more important than "how" or "do", it is NOT as important as
"battery" or "change", so I would not want to boost "widget" too much.

BTW, Solr facets makes this term info relatively easy to gather, thanks for
the pointer Yonik!  1.4 also speeds up Facets.

Idea 3: (possibly in conjunction with 2) Compare terms by product slice, and
do "complementary" IDF boosting.

In other words, it's very common for content on the Widget forum to have the
terms "widget", "wBracket" and "wcase".

But if a document in the "Cog" forum mentions "widget", that's actually
rather interesting.  Mentions of "widgets" in the Cog and Hooka content
should be treated as "important", regardless of how common "widget" is in
the Widget forum.

So we compare the common terms in the Widget, Hooka and Cog areas.  Terms
that are specific to Widgets get a bigger boost in the Cog and Hooka areas.
Cog terms get bigger boosts in Hooka and Widget areas, and Hooka terms get
extra credit in Cogs and Widgets.

This extra boosting could be done in addition to the minor boosting widget
terms get in the Widget area, just to keep them above the common English

Implementation of ideas 2 and 3:

This "diffing" of IDF by product slices would seem to be a good idea, though
a bit of coding to implement.  I'd think somebody would have done this
already looked into this more formally?

In terms of implementation, you could do an arbitrary cutoff at 100 terms
for each product area, and then just do a Boolean for whether it appears in
the other lists or not.  But after running some tests, common terms are
likely to mentioned in many lists, and it's more a question of where on the
list they were.  For example, if you can interface Cogs and Widgets, then
Widget content will occasionally mention Cogs, and vice versa.

To me this seems like more of a Bayesian or Contingency Table issue, with
some weight given to the actual number of occurrences or the relative
position on lists.  Any of you Brainiacs have specific insights?

There's also a question of scaling the Histogram.  If I compare
term-in-document counts in the Widget area to the entire corpus, then since
there are fewer documents in the Widget area (it's a subset of the total),
then by definition the number of documents with the word "i" in that area
will be lower.  You'd think the relative order wouldn't be affected, on
average, but any calculation using the occurrence counts would need to take
that into account.  In other words, if the word "can" appears more times in
the overall documents than it does in the Widget documents, well is that
just because there are more documents in the overall set, or because "can"
is really more dense.  Simple scaling can probably handle this issue...

Even if I compare Widget to Cog documents, both of which are subsets of the
total, there's likely to be a different number of documents in each area,
and therefore the counts would be different.  So I'll get differences in
"how" and "can", in addition to "widget" and "cog", just because the sample
sizes are different.  Again, scaling or limiting the sample size might fix
that too.  Presumably the ranking of the words "i" and "can", which one is
more popular than the other, would stay the same regardless of sample size
and absolute counts, if I were looking at random English text.  If those
terms change position, then that would be more interesting.

However, I think randomness also plays a role here.  If I do a histogram of
100 Widget documents and only 5 Cog documents, and I notice that the
relative positions of "i" and "can" have changed places, I might be
skeptical that it's a random fluctuation because I only had 5 Cog
documents.  If I added 50 more, would the difference still be there?

Sample size it certainly a well established topic in statistics, though I
haven't see it applied to histograms.

And another odd thing about histograms taken from different sample sizes are
the low frequency terms.  In all search engine histograms you'll see terms
that appear only 1 time, and there'll be a bunch of them.  But presumably if
I doubled the number of documents in the sample, some of those terms might
then appear twice, whereas others would still only appear once.  There are
other breaks in histogram counts, like when it does from 2 to 1, that would
presumably also shift around a bit.  But I'm not sure you can reliably use
sample size ration to scale these things with confidence.  This feels like
Poisson distribution territory...

I suspect all of these questions have also been examined in detail before; I
need to dig up my copy of that "Long Tail" book...  Wondering if anybody has
specific links or verbiage / author names to suggest?

There are certainly other ideas for indirectly addressing IDF problem by
using other methods to compensate:

4: You could overlay user reviews / popularity or click-through rates or
other social factors

5: Use flexible proximity searches, in addition to phrase matching, to boost
certain matches.  Verity used to call this the "near" operator, Lucene/Solr
talk about span queries and "phrase slop".

So in a question containing "how can I", documents with that exact phrase
get a substantial boost.  But documents where "how", "can" and "I" appear
within 5 words of each other in any order get also get a small boost.  It's
been suggested that this can help with Q&A type applications.

6: FAST's doc mentions that verbs can be more important in question and
answer type applications.  Normally nouns are said to contain 60-70% of the
relevance of a query (reference?), but their linguistics guide suggests
verbs are more important in Q and A apps.

7: Another idea would be to boost word pairs using composite N-Gram tokens,
what Lucene and Solr call "shingles".

Using my previous examples:
Question: "How can I get a widget ?"
    Normal index: "how", "can", "i", "get", "widget"
    Shingles: "how_can", "can_i", "i_get", "get_widget"
Question: "How do I change the widget battery?"
    Normal index: "how", "do", "i", "change", "widget", "battery"
    Shingles: "how_do", "do_i", "i_change", "change_widget",

And both sets of tokens would be added to the index, in different fields,
and with possibly different boosts.

The idea being that tokens like "can_i" might be somewhat more statistically
significant than "can" and "i" by themselves.  So at least in the first
question, which has all low quality words, the "can_i" might help a bit.

I'd appreciate any comments on these ideas from y'all, or perhaps names of
specific algorithms / authors.

Mark Bennett / New Idea Engineering, Inc. /
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message