lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen" <cdor...@gmail.com>
Subject Re: DocumentsWriter.checkMaxTermLength issues
Date Mon, 31 Dec 2007 16:37:56 GMT
On Dec 31, 2007 6:10 PM, Yonik Seeley <yonik@apache.org> wrote:

> On Dec 31, 2007 5:53 AM, Michael McCandless <lucene@mikemccandless.com>
> wrote:
> > Doron Cohen <cdoronc@gmail.com> wrote:
> > > I like the approach of configuration of this behavior in Analysis
> > > (and so IndexWriter can throw an exception on such errors).
> > >
> > > It seems that this should be a property of Analyzer vs.
> > > just StandardAnalyzer, right?
> > >
> > > It can probably be a "policy" property, with two parameters:
> > > 1) maxLength, 2) action: chop/split/ignore/raiseException when
> > > generating too long tokens.
> >
> > Agreed, this should be generic/shared to all analyzers.
> >
> > But maybe for 2.3, we just truncate any too-long term to the max
> > allowed size, and then after 2.3 we make this a settable "policy"?
>
> But we already have a nice component model for analyzers...
> why not just encapsulate truncation/discarding in a TokenFilter?


Makes sense, especially for the implementation aspect.
I'm not sure what API you have in mind:

(1) leave that for applications, to append such a
    TokenFilter to their Analyzer (== no change),

(2) DocumentsWriter to create such a TokenFilter
     under the cover, to force behavior that is defined (where?), or

(3) have an IndexingTokenFilter assigned to IndexWriter,
     make the default such filter trim/ignore/whatever as discussed
     and then applications can set a different IndexingTokenFilter for
     changing the default behavior?

I think I like the 3'rd option - is this what you meant?

Doron

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message