lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (LUCENE-5558) Add TruncateTokenFilter
Date Tue, 01 Apr 2014 04:33:15 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-5558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir resolved LUCENE-5558.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 5.0

Thanks Ahmet, very nice!

> Add TruncateTokenFilter
> -----------------------
>
>                 Key: LUCENE-5558
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5558
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.7
>            Reporter: Ahmet Arslan
>            Assignee: Robert Muir
>            Priority: Minor
>              Labels: Turkish, f5
>             Fix For: 4.8, 5.0
>
>         Attachments: LUCENE-5558.patch, LUCENE-5558.patch, LUCENE-5558.patch, LUCENE-5558.patch
>
>
> I am using this filter as a stemmer for Turkish language. In many academic research (classification,
retrieval) it is used and called as Fixed Prefix Stemmer or Simple Truncation Method or F5
in short.
> Among F3 TO F7, F5 stemmer (length=5) is found to work well for Turkish language in [Information
Retrieval on Turkish Texts|http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf].
It is the same work where most of stopwords_tr.txt are acquired. 
> ElasticSearch has [truncate|http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-truncate-tokenfilter.html]
filter but it does not respect keyword attribute. And it has a use case similar to TruncateFieldUpdateProcessorFactory
> Main advantage of F5 stemming is : it does not effected by the meaning loss caused by
ascii folding. It is a diacritics-insensitive stemmer and works well with ascii folding. [Effects
of diacritics on Turkish information retrieval|http://journals.tubitak.gov.tr/elektrik/issues/elk-12-20-5/elk-20-5-9-1010-819.pdf]
> Here is the full field type I use for "diacritics-insensitive search" for Turkish
> {code:xml}
>  <fieldType name="text_tr_ascii_f5" class="solr.TextField" positionIncrementGap="100">
>    <analyzer>
>      <tokenizer class="solr.StandardTokenizerFactory"/>
>      <filter class="solr.ApostropheFilterFactory"/>
>      <filter class="solr.TurkishLowerCaseFilterFactory"/>
>      <filter class="solr.ASCIIFoldingFilterFactory"/>
>      <filter class="solr.KeywordRepeatFilterFactory"/>
>      <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
>      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>    </analyzer>
> {code}
> I  would like to get community opinions :
> 1) Any interest in this? 
> 2) keyword attribute should be respected? 
> 3) package name analysis.misc versus analyis.tr 
> 4) name of the class TruncateTokenFilter versus FixedPrefixStemFilter



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message