lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: DocumentsWriter.checkMaxTermLength issues
Date Mon, 31 Dec 2007 17:25:33 GMT

On Dec 31, 2007, at 12:11 PM, Yonik Seeley wrote:

> On Dec 31, 2007 11:59 AM, Grant Ingersoll <> wrote:
>> On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:
>>> I meant (1)... it leaves the core smaller.
>>> I don't see any reason to have logic to truncate or discard tokens  
>>> in
>>> the core indexing code (except to handle tokens >16k as an error
>>> condition).
>> I would agree here, with the exception that I want the option for it
>> to be treated as an error.
> That should also be possible via an analyzer component throwing an  
> exception.

Sure, but I mean in the >16K (in other words, in the case where  
DocsWriter fails, which presumably only DocsWriter knows about) case.   
I want the option to ignore tokens larger than that instead of failing/ 
throwing an exception.  Imagine I am charged w/ indexing some data  
that I don't know anything about (i.e. computer forensics), my goal  
would be to index as much as possible in my first raw pass, so that I  
can then begin to explore the dataset.  Having it completely discard  
the document is not a good thing, but throwing away some large binary  
tokens would be acceptable (especially if I get warnings about said  
tokens) and robust.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message