lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: capturing field length into a stored document field
Date Fri, 04 Sep 2009 22:27:23 GMT
The Similarity.lengthNorm() is a callback from Lucene that gives you  
the information you seek.  Of course, the trick still is how to use  
that.  Perhaps you can describe a bit more about why you need that  

On Sep 4, 2009, at 11:34 AM, mike.schultz wrote:

> For various statistics I collect from an index it's important for me  
> to know
> the length (measured in tokens) of a document field.  I can get that
> information to some degree from the "norms" for the field but a) the
> resolution isn't that great, and b) more importantly, if boosts are  
> used
> it's almost impossible to get lengths from this.
> Here's two ideas I was thinking about that maybe some can comment on.
> 1) Use copyto to copy the field in question, fieldA to an addition  
> field,
> fieldALength, which has an extra filter that just counts the tokens  
> and only
> outputs a token representing the length of the field.  This has the
> disadvantage of retokenizing basically the whole document (because  
> the field
> in question is basically the body).  Plus I would think littering  
> the term
> space with these tokens might be bad for performance, I'm not sure.
> 2) Add a filter to the field in question which again counts the  
> tokens.
> This filter allows the regular tokens to be indexed as usual but  
> somehow
> manages to get the token-count into a stored field of the document.   
> This
> has the advantage of not having to retokenize the field and instead of
> littering the token space, the count becomes docdata for each doc.   
> Can this
> be done?  Maybe using threadLocal to temporarily store the count?
> Thanks.
> -- 
> View this message in context:
> Sent from the Solr - User mailing list archive at

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

View raw message