lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Sokolov (JIRA)" <>
Subject [jira] [Commented] (LUCENE-5620) LowerCaseFilter.preserveOriginal
Date Sat, 19 Apr 2014 18:51:14 GMT


Mike Sokolov commented on LUCENE-5620:

bq. doing this selectively (only adding additional terms in some cases) is pretty complicated
if you dont want to screw over length normalization

Interesting point, although it's debatable how strong the effect is - I guess it depends on
how many tokens are affected by the filter chain, and whether this varies in any significant
way from document to document: I tend to think that the number of capitalized words, say,
will be similar from document to document, but of course there will be exceptions in different
data sets. 

It makes me wonder whether length normalization shouldn't use max position instead of term
count when it is available.

> LowerCaseFilter.preserveOriginal
> --------------------------------
>                 Key: LUCENE-5620
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>         Attachments: LUCENE-5620.patch
> Following closely the model of LUCENE-5437 (which worked on ASCIIFoldingFilter), this
patch adds the ability to preserve the original token to LowerCaseFilter.  This is useful
if you want an all-lowercase search term to match without regard to case, while search terms
with uppercase letters match in a case-sensitive manner. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message