lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: DocumentsWriter.checkMaxTermLength issues
Date Mon, 31 Dec 2007 16:44:10 GMT
On Dec 31, 2007 11:37 AM, Doron Cohen <cdoronc@gmail.com> wrote:
>
> On Dec 31, 2007 6:10 PM, Yonik Seeley <yonik@apache.org> wrote:
>
> > On Dec 31, 2007 5:53 AM, Michael McCandless <lucene@mikemccandless.com>
> > wrote:
> > > Doron Cohen <cdoronc@gmail.com> wrote:
> > > > I like the approach of configuration of this behavior in Analysis
> > > > (and so IndexWriter can throw an exception on such errors).
> > > >
> > > > It seems that this should be a property of Analyzer vs.
> > > > just StandardAnalyzer, right?
> > > >
> > > > It can probably be a "policy" property, with two parameters:
> > > > 1) maxLength, 2) action: chop/split/ignore/raiseException when
> > > > generating too long tokens.
> > >
> > > Agreed, this should be generic/shared to all analyzers.
> > >
> > > But maybe for 2.3, we just truncate any too-long term to the max
> > > allowed size, and then after 2.3 we make this a settable "policy"?
> >
> > But we already have a nice component model for analyzers...
> > why not just encapsulate truncation/discarding in a TokenFilter?
>
>
> Makes sense, especially for the implementation aspect.
> I'm not sure what API you have in mind:
>
> (1) leave that for applications, to append such a
>     TokenFilter to their Analyzer (== no change),
>
> (2) DocumentsWriter to create such a TokenFilter
>      under the cover, to force behavior that is defined (where?), or
>
> (3) have an IndexingTokenFilter assigned to IndexWriter,
>      make the default such filter trim/ignore/whatever as discussed
>      and then applications can set a different IndexingTokenFilter for
>      changing the default behavior?
>
> I think I like the 3'rd option - is this what you meant?

I meant (1)... it leaves the core smaller.
I don't see any reason to have logic to truncate or discard tokens in
the core indexing code (except to handle tokens >16k as an error
condition).

Most of the time you want to catch those large tokens early on in the
chain anyway (put the filter right after the tokenizer).  Doing it
later could cause exceptions or issues with other token filters that
might not be expecting huge tokens.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message