lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Rochkind <rochk...@jhu.edu>
Subject RE: Highest frequency terms for a subset of documents
Date Wed, 20 Apr 2011 23:11:56 GMT
I think faceting is probably the best way to do that, indeed. It might be slow, but it's kind
of set up for exactly that case, I can't imagine any other technique being faster -- there's
stuff that has to be done to look up the info you want. 

BUT, I see your problem:  don't use facet.method=enum. Use facet.method=fc.  Works a LOT better
for very high arity fields (lots and lots of unique values) like you have. I bet you'll see
significant speed-up if you use facet.method=fc instead, hopefully fast enough to be workable.


With facet.method=enum, I would have indeed predicted it would be horribly slow, before solr
1.4 when facet.method=fc became available, it was nearly impossible to facet on very high
arity fields, facet.method=fc is the magic. I think facet.method=fc is even the default in
Solr 1.4+, if you hadn't explicitly set it to enum instead! 

Jonathan
________________________________________
From: Ofer Fort [oferiko@gmail.com]
Sent: Wednesday, April 20, 2011 6:49 PM
To: solr-user@lucene.apache.org
Subject: Highest frequency terms for a subset of documents
Hi,
I am looking for the best way to find the terms with the highest frequency
for a given subset of documents. (terms in the text field)
My first thought was to do a count facet search , where the query defines
the subset of documents and the facet.field is the text field, this gives me
the result but it is very very slow.
These are my params:
<str name="facet">true</str>
<str name="facet.offset">0</str>
<str name="facet.mincount">3</str>
<str name="indent">on</str>
<str name="facet.limit">500</str>
<str name="facet.method">enum</str>
<str name="wt">xml</str>
<str name="rows">0</str>
<str name="version">2.2</str>
<str name="facet.sort">count</str>
   <str name="q">in_subset:1</str>
<str name="facet.field">text</str>
</lst>

The index contains 7M documents, the subset is about 200K. A simple query
for the subset takes around 100ms, but the facet search takes 40s.

Am i doing something wrong?

If facet search is not the correct approach, i thought about using something
like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
in solr. Should i implememt a request handler that executes this kind of
code?

thanks for any help

Mime
View raw message