lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-5211) StopFilterFactory docs do not advertise/explain hte "format" option
Date Tue, 17 Sep 2013 21:39:52 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hoss Man updated LUCENE-5211:
-----------------------------

    Attachment: LUCENE-5211.code.patch
                LUCENE-5211.stopfilecomments.patch

two patches to make it easier to review...

* patch that improves the StopFilterFactory javadocs to mention format, as well as improves
the error handling of the format param (includes tests)
* patch that updates all the snowball formatted files with a comment pointing out hteneed
to use format="snowball" with those files.

FWIW: the second patch was generated by the following perl script...

{code}
#!/usr/bin/perl -i -n

my $msg = q{NOTE: To use this file with StopFilterFactory, you must specify format="snowball"};
print $_;
if (m/This notice was added./) {
    print " |\n | $msg\n";
}
{code}

Run as...
{{find -name \*.txt | xargs grep -l "This notice was added" | xargs ~/tmp/lucene5211.note.in.snowballfiles.pl}}

                
> StopFilterFactory docs do not advertise/explain hte "format" option
> -------------------------------------------------------------------
>
>                 Key: LUCENE-5211
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5211
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.2
>            Reporter: Hayden Muhl
>            Assignee: Hoss Man
>            Priority: Minor
>         Attachments: LUCENE-5211.code.patch, LUCENE-5211.stopfilecomments.patch
>
>
> StopFilterFactory supports a "format" option for controlling wether "getWordSet" or "getSnowballWordSet"
is used to parse the file, but this option is not advertised and people can be confused by
looking at the example stopword files include in the releases (some of which are in the snoball
format w/ "|" comments) and try to use them w/o explicitly specifying {{format="snowball"}}
and silently get useless stopwords (that include the "| comments" as literal portions of hte
stopwrds.
> we need to better document the use of "format" and consider updating all of the example
stopword files we ship that are in the snowball format with a note about the need to use {{format="snowball"}}
with those files.
> {panel:title=Initial Bug Report}
> The StopFilterFactory builds a CharArraySet directly from the raw lines of the supplied
words file. This causes a problem when using the stop word files supplied with the Solr/Lucene
distribution. In particular, the comments in those files get added to the CharArraySet. A
line like this...
> ceci           |  this
> Should result in the string "ceci" being added to the CharArraySet, but "ceci       
   |  this" is what actually gets added.
> Workaround: Remove all comments from stop word files you are using.
> Suggested fix: The StopFilterFactory should strip any comments, then strip trailing whitespace.
The stop word files supplied with the distribution should be edited to conform to the supported
comment format.
> {panel}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message