lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Facets with an IDF concept
Date Wed, 24 Jun 2009 04:10:16 GMT

On Jun 23, 2009, at 6:23 PM, Chris Hostetter wrote:

> : Regardless of the semantics, it doesn't sound like DF would give  
> you what you
> : want.  It could be entirely possible that in some short timespan  
> the number of
> : docs on Iran could match up w/ the number on Obama (maybe not for  
> that
> : particular example) in which case your "hot" item would no longer  
> appear hot.
> but if hte numbers match up in that timespan then the "hot" item  
> isn't as
> "hot" anymore.

Not necessarily true.  Consider the case where over the year there are  
50 stories about Obama.  Then, in the span of 5 days, there are 50  
stories about Iran.  Iran, in my view, is still hotter than Obama.  In  
Asif's case, he was suggesting comparing against the global DF.

Not to worry, though, your proposal is much the same as mine, namely  
take a baseline based on some set of docs (I chose *:*, you chose past  
month) and then compare.

> Myabe i'm missunderstanding: but it sounds like Asif's question  
> esentailly
> boils down to getting facet constraints sorted after using some
> normalizing fraction ... the simplest case being the inverse ratio  
> (this
> is where i think Asif is comparing it to IDF) of the number of  
> matches for
> that facet in some larger docset to the size of the docset-- typically
> that docset could be the entire index, but it could also be the same
> search over a large window of time.
> So if i was doing a news search for all docs in the last 24 hours, I  
> could
> multiple each of those facet counts by the ratio of the corrisponding
> counts from the past month to the number of articles from the past  
> monght
> see how much "hotter" they are in my smaller result set...
> current result set facet counts (X)...
>  News:1100
>  Obama:1000
>  Iran:800
>  Miley Cyrus:700
>  iPod:500
> facet counts from the past month (Y), during which type 9000 (Z)
> documents were published...
>  News:9000
>  Obama:7000
>  Iran:1000
>  Miley Cyrus:4000
>  iPod:5000
> X*(Z/Y)...
>  Iran:7200
>  Miley Cyrus:1575
>  Obama:1285.7
>  News:1100
>  iPod:900
> Doing this in a Solr plugin would be the best way to to this --  
> because
> otherwise your "hot" terms might not even show up in the facet lists.
> any attempt to do it on the client would just be an approximation, and
> could easily miss the "hottest" item if it was just below cutoff for  
> hte
> number of constraints to be returned.
> -Hoss

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

View raw message