lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Breu <Michael.B...@arctis.at>
Subject Solved: real infix suggester, not AnalyzingInfixSuggester
Date Fri, 31 Oct 2014 16:58:21 GMT

Hello Oliver,

Just to close this case:
With your hint I was able to create an InfixSuggester based on the
AnalyzingSuggester. The clue was to wrap the basic analyzer by an
NGramTokenFilter. The solution is quite simple, however it took me some
time to fit the pieces together.

    private final Analyzer autocompletionAnalyzer = new
AnalyzerWrapper(Analyzer.PER_FIELD_REUSE_STRATEGY){

          @Override
          protected Analyzer getWrappedAnalyzer(String fieldName) {
            return analyzer;
          }

          @Override
          protected TokenStreamComponents wrapComponents(String
fieldName, TokenStreamComponents components) {
              return new TokenStreamComponents(components.getTokenizer(),
                                               new
NGramTokenFilter(Version.LUCENE_47,
                                                                       
components.getTokenStream(),
                                                                       
MIN_INFIX_SUGGESTER_CHARS, 100));
        }};


            AnalyzingSuggester newInfixSuggester = new
AnalyzingSuggester(autocompletionAnalyzer, analyzer);

The rest is based on an older proposal by Mat Mannion on 
http://stackoverflow.com/questions/120180/how-to-do-query-auto-completion-suggestions-in-lucene

Best regards

Michael
> Oliver Christ <mailto:ochrist@EBSCO.COM>
> Montag, 27. Oktober 2014 15:09
>
> Hi Michael,
>
>  
>
> There may be several entry points, I'm not sure which one still works
> -- the suggester data processing chain has changed quite a bit since I
> looked at it about two years ago, maybe Mike or Robert can chime in if
> I'm totally off.
>
>  
>
> One way I experimented with was to implement a custom TermFreqIterator
> which essentially iterates over some input data source and returns a
> sequence of (String, weight) tuples. You can pass your custom
> TermFreqIterator when calling AnalyzingSuggester#build(). The custom
> TermFreqIterator returns each suffix of the actual input with the same
> input weight, in an inner loop.
>
>  
>
> I think TermFreqIterator is now called InputIterator, but the
> principle is the same.
>
>  
>
> Cheers, Oli
>
>  
>
>  
>
> *From:*Michael Breu [mailto:Michael.Breu@arctis.at]
> *Sent:* Monday, October 27, 2014 9:21 AM
> *To:* java-user@lucene.apache.org
> *Subject:* Re: real infix suggester, not AnalyzingInfixSuggester
>
>  
>
> Hello Oliver,
>
> I already had a look into the AnalyzingSuggester before. I was not
> able to spot the location where it generates the prefixes. It works
> with some path analysis based on automaton (both for analysis and query).
> It is not really clear to me how to extend this automaton.
>
> Could you give me a hint, how to start?
>
> Thank you for your kind support
>
> Michael
>
>
> *Oliver Christ* <mailto:ochrist@EBSCO.COM>
>
> Montag, 27. Oktober 2014 12:47
>
> The hard way may be to use the standard Analyzing Suggester but to add
> each (analyzed) suffix of the surface string (mapping to the full
> surface form) during automaton generation.
>
> I.e. when adding "Donau...", you add all analyzed suffixes "donau...",
> "onau...", "nau...", ... - all mapping to "Donau...", with identical
> rank.
>
> I think on equal inputs, the rank of the last one added wins, but I'm
> not sure.
>
> You may "drown" in unspecific suggestions at least for short inputs,
> and the automata will get large. But it should give you a suggester
> you can play around with to evaluate whether you need decompounding
> (you probably do).
>
> Cheers, Oli
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> <mailto:java-user-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.org
> <mailto:java-user-help@lucene.apache.org>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> <mailto:java-user-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.org
> <mailto:java-user-help@lucene.apache.org>
>
> *Michael Sokolov* <mailto:msokolov@safaribooksonline.com>
>
> Montag, 27. Oktober 2014 12:22
>
> Have you considered combining the AnalyzingInfixSuggester with a
> German decompounding filter?  If you break compound words into their
> constituent parts during analysis, then the suggester will be able to
> do what you want (prefix matches on the word-parts).  I found this
> project with a quick google search:
> https://github.com/jprante/elasticsearch-analysis-decompound; I don't
> know how good it is or whether it fits with your environment, but it
> could be a start.
>
> -Mike
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> <mailto:java-user-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.org
> <mailto:java-user-help@lucene.apache.org>
>
> *Michael Breu* <mailto:Michael.Breu@arctis.at>
>
> Montag, 27. Oktober 2014 11:34
>
> Hello,
>
> I'm looking for an infix suggester that allows infix search for a given
> term. This might not be that important in English.
> However in German we have quite complex composite words like
> Donaudampfschifffahrtsgesellschaftskapitän
> which is composed by the nouns Donau (danube), Dampf (steam), schiff
> (boat), etc.
>
> So I would like to support searches like *schiff* to suggest
> Donaudampfschifffahrtsgesellschaft.
>
> I have mistakenly tried for the AnalyzingInfixSuggester, however this
> does not do what I expect, because it does prefix matches to tokens, but
> no infix matches.
>
> I tried to adapt the AnalyzingSuggester, however it seemed to complex
> for an easy conversion to an infix suggester.
>
> I know that this was already asked by
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201103.mbox/%3C1301054307585-2729996.post@n3.nabble.com%3E,
> however, nobody answered this post as far as I know.
>
> Thank you for your help
>
> Wallenstein
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> <mailto:java-user-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.org
> <mailto:java-user-help@lucene.apache.org>
>
> Michael Breu <mailto:Michael.Breu@arctis.at>
> Montag, 27. Oktober 2014 14:20
> Hello Oliver,
>
> I already had a look into the AnalyzingSuggester before. I was not
> able to spot the location where it generates the prefixes. It works
> with some path analysis based on automaton (both for analysis and query).
> It is not really clear to me how to extend this automaton.
>
> Could you give me a hint, how to start?
>
> Thank you for your kind support
>
> Michael
>
> Oliver Christ <mailto:ochrist@EBSCO.COM>
> Montag, 27. Oktober 2014 12:47
> The hard way may be to use the standard Analyzing Suggester but to add
> each (analyzed) suffix of the surface string (mapping to the full
> surface form) during automaton generation.
>
> I.e. when adding "Donau...", you add all analyzed suffixes "donau...",
> "onau...", "nau...", ... - all mapping to "Donau...", with identical
> rank.
>
> I think on equal inputs, the rank of the last one added wins, but I'm
> not sure.
>
> You may "drown" in unspecific suggestions at least for short inputs,
> and the automata will get large. But it should give you a suggester
> you can play around with to evaluate whether you need decompounding
> (you probably do).
>
> Cheers, Oli
>
> -----Original Message-----
> From: Michael Sokolov [mailto:msokolov@safaribooksonline.com]
> Sent: Monday, October 27, 2014 7:23 AM
> To: java-user@lucene.apache.org
> Subject: Re: real infix suggester, not AnalyzingInfixSuggester
>
> Have you considered combining the AnalyzingInfixSuggester with a
> German decompounding filter? If you break compound words into their
> constituent parts during analysis, then the suggester will be able to
> do what you want (prefix matches on the word-parts). I found this
> project with a quick google search:
> https://github.com/jprante/elasticsearch-analysis-decompound; I don't
> know how good it is or whether it fits with your environment, but it
> could be a start.
>
> -Mike
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> Michael Sokolov <mailto:msokolov@safaribooksonline.com>
> Montag, 27. Oktober 2014 12:22
> Have you considered combining the AnalyzingInfixSuggester with a
> German decompounding filter?  If you break compound words into their
> constituent parts during analysis, then the suggester will be able to
> do what you want (prefix matches on the word-parts).  I found this
> project with a quick google search:
> https://github.com/jprante/elasticsearch-analysis-decompound; I don't
> know how good it is or whether it fits with your environment, but it
> could be a start.
>
> -Mike
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> Michael Breu <mailto:Michael.Breu@arctis.at>
> Montag, 27. Oktober 2014 11:34
> Hello,
>
> I'm looking for an infix suggester that allows infix search for a given
> term. This might not be that important in English.
> However in German we have quite complex composite words like
> Donaudampfschifffahrtsgesellschaftskapitän
> which is composed by the nouns Donau (danube), Dampf (steam), schiff
> (boat), etc.
>
> So I would like to support searches like *schiff* to suggest
> Donaudampfschifffahrtsgesellschaft.
>
> I have mistakenly tried for the AnalyzingInfixSuggester, however this
> does not do what I expect, because it does prefix matches to tokens, but
> no infix matches.
>
> I tried to adapt the AnalyzingSuggester, however it seemed to complex
> for an easy conversion to an infix suggester.
>
> I know that this was already asked by
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201103.mbox/%3C1301054307585-2729996.post@n3.nabble.com%3E,
> however, nobody answered this post as far as I know.
>
> Thank you for your help
>
> Wallenstein
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Mime
View raw message