lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikhail Khludnev <mkhlud...@griddynamics.com>
Subject Re: relative token count in a query result
Date Wed, 21 Nov 2012 05:46:10 GMT
Hello,

Have you tried to implement your own Collector and pass it into
IndexSearch.search()? Collector has a reference to the current scorer, and
therefore presumably can access tf info from TermQueryScorer:
org.apache.lucene.search.TermScorer.freq(). Then collector can just sum
these tfs.

Be aware, of small problem of doing the same with few disjunction clauses.


On Tue, Nov 20, 2012 at 11:55 PM, tech.vronk <tech@vronk.net> wrote:

> Hello,
>
> earlier, I was trying to retrieve the total token count per index
> http://lucene.472066.n3.**nabble.com/how-to-retrieve-**
> total-token-count-per-**collection-index-td4000161.**html<http://lucene.472066.n3.nabble.com/how-to-retrieve-total-token-count-per-collection-index-td4000161.html>
> .
>
> now, I would like to have a token (word) count within the document-set
> (resulting of a query),
> both for the matching word and as sum of all tokens of matching documents.
>
> The ultimate goal is to be able to compute relative frequencies of terms,
> on token-base instead of per article base.
>
> so if I search for word "Haus" within a subcollection (defined by a
> separate query) and the word appears in a matching doc A 2 times and doc B
> 5 times, i need as hit-count: 7 not 2.
>
> + if the subcollection contains documents
> A with 300 tokens (i.e. running words, not different terms)
> B with 100 tokens
> C with 50 tokens
>
> I also need this second sum, i.e. 450.
>
> I plan to get the second number by first
> preprocessing the document counting the tokens
> storing the number in a separate field,
> then applying the statsComponent,
> which will deliver me the sum for given query/subcollection.
>
> for the first number, i could use the termfreq() function,
> but that gives me only the term frequency per document.
>
> So, before I iterate over the whole result, to sum it,
> I wonder, if the statsComponent would be able to perform the counting also
> over a dynamic field (the result of the function).
> I tried this:
> /solr/select/?fq=docsrc:**falter&q={!func}tf(inhalt,'**
> haus')&stats=true&stats.field=**score&rows=10&indent=true&fl=**
> score&debugQuery=true
>
> but got the error:
> <str name="msg">Field type text_de{class=org.apache.solr.**
> schema.TextField,analyzer=org.**apache.solr.analysis.**
> TokenizerChain,args={**positionIncrementGap=100}} is not currently
> supported</str>
>
> Or is there any other way?
>
> If I understand it correctly, any of tf(), idf(), sttf(), wouldn't be of
> any help here neither.
>
> Thanks in advance
>
> best,
> matej
>
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mkhludnev@griddynamics.com>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message