lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
Date Tue, 01 Dec 2009 21:13:20 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784432#action_12784432
] 

Simon Willnauer commented on LUCENE-2034:
-----------------------------------------

bq. Some of the analyzers allow for null to be specified for the stop word list. Others require
an empty set/file/reader. Those deriving from StopawareAnalyzer allow null.
That is true - Stopawareanalyzer uses an empty set if you pass null. 

bq. I'd like to see the ability to use null to follow through the rest of the analyzers.
*Some of the analyzers are cluttered with stopword list processing.
The analyzers in this patch are rather a PoC than a complete list. Eventually we will have
all analyzers with stopwords to extend StopawareAnalyzer that is also the reason why we have
this class. This and some other issues aim to eventually have a consistent way of processing
all this stuff related to stopwords. We will also remove all the setters and have Set<?>
only ctors for consistency.

bq. If not how about adding public static Set<?> getDefaultStopSet() to StopawareAnalyzer?
the problem is that it is static and it should be static. Thats why we define it in each analyzer
that uses stopwords. I would like to have it generalized but this seems to be the ideal solution.
We could have something like a getDefaultStopSet(Class<? extends StopawareAnalyzer>)
but I like the expressiveness of getDefaultStopSet() way better though.

bq. How about splitting out the stop words to their own class? 
What do you mean by that?  can you elaborate?

bq. There are some TODOs in the code to make this or that private or final. If this is going
to wait for 3.1 shouldn't they change?
The should actually go away but I kept them in there because they are somewhat unrelated to
this particular issue. Once this is in we will work on removing the deprecated stuff and make
analyzers final (at least in contrib).

bq. In WordListLoader the return types are not Set or Map, but HashSet and HashMap. What's
up with that? Should anyone care what the particular implementation is?
that is one thing I hate about WordListLoader. +1 towards Uwe working on them!

bq. I'm trying to figure out a way to specify a tokenizer/filter chain. (I've been trying
to figure it out for a while, but not with much effort or success).
This has been discussed already and we haven't had much of a success though. I can not remember
the issue (robert can you remember the factory issue?) but it was basically based on a factory
pattern. This would also be my approach to it. That way we could get rid of almost every analyzer.
I use such a pattern myself which works quite well.

bq. DM, I think we can have both? A method to get the default stopword list, but then they
also happen to be in text files too?
+1 for having those words in files. Nevertheless we will have a default stopword list though.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-2034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch,
LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement
at least one of the methodes returning a tokenStream. When you look at the code it appears
to be almost identical if both are implemented in the same analyzer.  Each analyzer defnes
the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way
of loading them or defines a large number of ctors to load stopwords from a file, set, arrays
etc.. those ctors should be removed / deprecated and eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message