lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: DocumentsWriter.checkMaxTermLength issues
Date Mon, 31 Dec 2007 17:25:33 GMT

On Dec 31, 2007, at 12:11 PM, Yonik Seeley wrote:

> On Dec 31, 2007 11:59 AM, Grant Ingersoll <gsingers@apache.org> wrote:
>>
>> On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:
>>> I meant (1)... it leaves the core smaller.
>>> I don't see any reason to have logic to truncate or discard tokens  
>>> in
>>> the core indexing code (except to handle tokens >16k as an error
>>> condition).
>>
>> I would agree here, with the exception that I want the option for it
>> to be treated as an error.
>
> That should also be possible via an analyzer component throwing an  
> exception.
>

Sure, but I mean in the >16K (in other words, in the case where  
DocsWriter fails, which presumably only DocsWriter knows about) case.   
I want the option to ignore tokens larger than that instead of failing/ 
throwing an exception.  Imagine I am charged w/ indexing some data  
that I don't know anything about (i.e. computer forensics), my goal  
would be to index as much as possible in my first raw pass, so that I  
can then begin to explore the dataset.  Having it completely discard  
the document is not a good thing, but throwing away some large binary  
tokens would be acceptable (especially if I get warnings about said  
tokens) and robust.

-Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message