lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: DocumentsWriter.checkMaxTermLength issues
Date Mon, 31 Dec 2007 17:54:38 GMT
I actually think indexing should try to be as robust as possible.  You
could test like crazy and never hit a massive term, go into production
(say, ship your app to lots of your customer's computers) only to
suddenly see this exception.  In general it could be a long time before
you "accidentally" our users see this.

So I'm thinking we should have the default behavior, in IndexWriter,
be to skip immense terms?

Then people can use TokenFilter to change this behavior if they want.


Yonik Seeley <> wrote:
> On Dec 31, 2007 12:25 PM, Grant Ingersoll <> wrote:
> > Sure, but I mean in the >16K (in other words, in the case where
> > DocsWriter fails, which presumably only DocsWriter knows about) case.
> > I want the option to ignore tokens larger than that instead of failing/
> > throwing an exception.
> I think the issue here is what the default behavior for IndexWriter should be.
> If configuration is required because something other than the default
> is desired, then one could use a TokenFilter to change the behavior
> rather than changing options on IndexWriter.  Using a TokenFilter is
> much more flexible.
> > Imagine I am charged w/ indexing some data
> > that I don't know anything about (i.e. computer forensics), my goal
> > would be to index as much as possible in my first raw pass, so that I
> > can then begin to explore the dataset.  Having it completely discard
> > the document is not a good thing, but throwing away some large binary
> > tokens would be acceptable (especially if I get warnings about said
> > tokens) and robust.
> -Yonik
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message