lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <>
Subject Re: DocumentsWriter.checkMaxTermLength issues
Date Mon, 31 Dec 2007 17:47:02 GMT
On Dec 31, 2007 12:25 PM, Grant Ingersoll <> wrote:
> Sure, but I mean in the >16K (in other words, in the case where
> DocsWriter fails, which presumably only DocsWriter knows about) case.
> I want the option to ignore tokens larger than that instead of failing/
> throwing an exception.

I think the issue here is what the default behavior for IndexWriter should be.

If configuration is required because something other than the default
is desired, then one could use a TokenFilter to change the behavior
rather than changing options on IndexWriter.  Using a TokenFilter is
much more flexible.

> Imagine I am charged w/ indexing some data
> that I don't know anything about (i.e. computer forensics), my goal
> would be to index as much as possible in my first raw pass, so that I
> can then begin to explore the dataset.  Having it completely discard
> the document is not a good thing, but throwing away some large binary
> tokens would be acceptable (especially if I get warnings about said
> tokens) and robust.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message