lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Masaru Hasegawa (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5905) Different behaviour of JapaneseAnalyzer at indexing time vs. at search time results in no matches for some words.
Date Tue, 12 Jan 2016 12:38:39 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093824#comment-15093824
] 

Masaru Hasegawa commented on LUCENE-5905:
-----------------------------------------

Hacky workaround would be to set outputCompounds in JapaneseTokenizer to false using reflection
or something.
(Or simply use extended or normal mode instead of search mode)

But root cause looks to be the way JapaneseTokenizer determines whether it emits 2nd best
segmentation:

{code}
          // Use the penalty to set maxCost on the 2nd best
          // segmentation:
          int maxCost = posData.costs[bestIDX] + penalty;
          if (lastLeftWordID != -1) {
            maxCost += costs.get(getDict(backType).getRightId(backID), lastLeftWordID);
          }
                      :
                      :
          for(int idx=0;idx<posData.count;idx++) {
            int cost = posData.costs[idx];
            //System.out.println("    idx=" + idx + " prevCost=" + cost);

            if (lastLeftWordID != -1) {
              cost += costs.get(getDict(posData.backType[idx]).getRightId(posData.backID[idx]),
                                lastLeftWordID);
              //System.out.println("      += bgCost=" + costs.get(getDict(posData.backType[idx]).getRightId(posData.backID[idx]),
              //lastLeftWordID) + " -> " + cost);
            }
            //System.out.println("penalty " + posData.backPos[idx] + " to " + pos);
            //cost += computePenalty(posData.backPos[idx], pos - posData.backPos[idx]);
            if (cost < leastCost) {
              //System.out.println("      ** ");
              leastCost = cost;
              leastIDX = idx;
            }
          }

                      :
                      :

{code}

It looks maxCost differs depending on following token. Because of that, depending token after
秋葉原, it emits 2nd best segmentation.
But I think behavior shouldn't change depending on next token. (we always want to have consistent
output for the same token, don't we?)

If I understand it correctly, if it doesn't take connection cost to next token like below,
it should produce consistent result.
{code}
          // Use the penalty to set maxCost on the 2nd best
          // segmentation:
          int maxCost = posData.costs[bestIDX] + penalty;
                      :
                      :
          for(int idx=0;idx<posData.count;idx++) {
            int cost = posData.costs[idx];
            //System.out.println("    idx=" + idx + " prevCost=" + cost);
             if (cost < leastCost) {
              //System.out.println("      ** ");
              leastCost = cost;
              leastIDX = idx;
            }
          }
                      :
                      :
{code}

Thoughts?

> Different behaviour of JapaneseAnalyzer at indexing time vs. at search time results in
no matches for some words.
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5905
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5905
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 3.6.2, 4.9, 5.2.1
>         Environment: Java 8u5
>            Reporter: Trejkaz
>
> A document with the word 秋葉原 in the body, when analysed by the JapaneseAnalyzer
(AKA Kuromoji), cannot be found when searching for the same text as a phrase query.
> Two programs are provided to reproduce the issue. Both programs print out the term docs
and positions and then the result of parsing the phrase query.
> As shown by the output, at analysis time, there is a lone Japanese term "秋葉原".
At query parsing time, there are *three* such terms - "秋葉" and "秋葉原" at position
0 and "原" at position 1. Because all terms must be present for a phrase query to be a match,
the query never matches, which is quite a serious issue for us.
> *Any workarounds, no matter how hacky, would be extremely helpful at this point.*
> My guess is that this is a quirk with the analyser. If it happened with StandardAnalyzer,
surely someone would have discovered it before I did.
> Lucene 5.2.1 reproduction:
> {code:java}
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.TextField;
> import org.apache.lucene.index.DirectoryReader;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.IndexWriterConfig;
> import org.apache.lucene.index.LeafReader;
> import org.apache.lucene.index.LeafReaderContext;
> import org.apache.lucene.index.MultiFields;
> import org.apache.lucene.index.PostingsEnum;
> import org.apache.lucene.index.Terms;
> import org.apache.lucene.index.TermsEnum;
> import org.apache.lucene.queryparser.flexible.standard.StandardQueryParser;
> import org.apache.lucene.queryparser.flexible.standard.config.StandardQueryConfigHandler;
> import org.apache.lucene.search.DocIdSetIterator;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.TopDocs;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.util.Bits;
> import org.apache.lucene.util.BytesRef;
> public class LuceneMissingTerms {
>     public static void main(String[] args) throws Exception {
>         try (Directory directory = new RAMDirectory()) {
>             Analyzer analyser = new JapaneseAnalyzer();
>             try (IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(analyser)))
{
>                 Document document = new Document();
>                 document.add(new TextField("content", "blah blah commercial blah blah
\u79CB\u8449\u539F blah blah", Field.Store.NO));
>                 writer.addDocument(document);
>             }
>             try (IndexReader multiReader = DirectoryReader.open(directory)) {
>                 for (LeafReaderContext leaf : multiReader.leaves()) {
>                     LeafReader reader = leaf.reader();
>                     Terms terms = MultiFields.getFields(reader).terms("content");
>                     TermsEnum termsEnum = terms.iterator();
>                     BytesRef text;
>                     //noinspection NestedAssignment
>                     while ((text = termsEnum.next()) != null) {
>                         System.out.println("term: " + text.utf8ToString());
>                         Bits liveDocs = reader.getLiveDocs();
>                         PostingsEnum postingsEnum = termsEnum.postings(liveDocs, null,
PostingsEnum.POSITIONS);
>                         int doc;
>                         //noinspection NestedAssignment
>                         while ((doc = postingsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS)
{
>                             System.out.println("  doc: " + doc);
>                             int freq = postingsEnum.freq();
>                             for (int i = 0; i < freq; i++) {
>                                 int pos = postingsEnum.nextPosition();
>                                 System.out.println("    pos: " + pos);
>                             }
>                         }
>                     }
>                 }
>                 StandardQueryParser queryParser = new StandardQueryParser(analyser);
>                 queryParser.setDefaultOperator(StandardQueryConfigHandler.Operator.AND);
>                 // quoted to work around strange behaviour of StandardQueryParser treating
this as a boolean query.
>                 Query query = queryParser.parse("\"\u79CB\u8449\u539F\"", "content");
>                 System.out.println(query);
>                 TopDocs topDocs = new IndexSearcher(multiReader).search(query, 10);
>                 System.out.println(topDocs.totalHits);
>             }
>         }
>     }
> }
> {code}
> Lucene 4.9 reproduction:
> {code:java}
> import org.apache.lucene.analysis.Analyzer;
>     import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
>     import org.apache.lucene.document.Document;
>     import org.apache.lucene.document.Field;
>     import org.apache.lucene.document.TextField;
>     import org.apache.lucene.index.AtomicReader;
>     import org.apache.lucene.index.AtomicReaderContext;
>     import org.apache.lucene.index.DirectoryReader;
>     import org.apache.lucene.index.DocsAndPositionsEnum;
>     import org.apache.lucene.index.IndexReader;
>     import org.apache.lucene.index.IndexWriter;
>     import org.apache.lucene.index.IndexWriterConfig;
>     import org.apache.lucene.index.MultiFields;
>     import org.apache.lucene.index.Terms;
>     import org.apache.lucene.index.TermsEnum;
>     import org.apache.lucene.queryparser.flexible.standard.StandardQueryParser;
>     import org.apache.lucene.queryparser.flexible.standard.config.StandardQueryConfigHandler;
>     import org.apache.lucene.search.DocIdSetIterator;
>     import org.apache.lucene.search.IndexSearcher;
>     import org.apache.lucene.search.Query;
>     import org.apache.lucene.search.TopDocs;
>     import org.apache.lucene.store.Directory;
>     import org.apache.lucene.store.RAMDirectory;
>     import org.apache.lucene.util.Bits;
>     import org.apache.lucene.util.BytesRef;
>     import org.apache.lucene.util.Version;
>     public class LuceneMissingTerms {
>         public static void main(String[] args) throws Exception {
>             try (Directory directory = new RAMDirectory()) {
>                 Analyzer analyser = new JapaneseAnalyzer(Version.LUCENE_4_9);
>                 try (IndexWriter writer = new IndexWriter(directory,
> new IndexWriterConfig(Version.LUCENE_4_9, analyser))) {
>                     Document document = new Document();
>                     document.add(new TextField("content", "blah blah
> commercial blah blah \u79CB\u8449\u539F blah blah", Field.Store.NO));
>                     writer.addDocument(document);
>                 }
>                 try (IndexReader multiReader =
> DirectoryReader.open(directory)) {
>                     for (AtomicReaderContext atomicReaderContext :
> multiReader.leaves()) {
>                         AtomicReader reader = atomicReaderContext.reader();
>                         Terms terms =
> MultiFields.getFields(reader).terms("content");
>                         TermsEnum termsEnum = terms.iterator(null);
>                         BytesRef text;
>                         //noinspection NestedAssignment
>                         while ((text = termsEnum.next()) != null) {
>                             System.out.println("term: " + text.utf8ToString());
>                             Bits liveDocs = reader.getLiveDocs();
>                             DocsAndPositionsEnum docsAndPositionsEnum
> = termsEnum.docsAndPositions(liveDocs, null);
>                             int doc;
>                             //noinspection NestedAssignment
>                             while ((doc =
> docsAndPositionsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
>                                 System.out.println("  doc: " + doc);
>                                 int freq = docsAndPositionsEnum.freq();
>                                 for (int i = 0; i < freq; i++) {
>                                     int pos =
> docsAndPositionsEnum.nextPosition();
>                                     System.out.println("    pos: " + pos);
>                                 }
>                             }
>                         }
>                     }
>                     StandardQueryParser queryParser = new
> StandardQueryParser(analyser);
> queryParser.setDefaultOperator(StandardQueryConfigHandler.Operator.AND);
>                     // quoted to work around strange behaviour of
> StandardQueryParser treating this as a boolean query.
>                     Query query =
> queryParser.parse("\"\u79CB\u8449\u539F\"", "content");
>                     System.out.println(query);
>                     TopDocs topDocs = new
> IndexSearcher(multiReader).search(query, 10);
>                     System.out.println(topDocs.totalHits);
>                 }
>             }
>         }
>     }
> {code}
> Lucene 3.6.2 reproduction:
> {code:java}
>     import org.apache.lucene.analysis.Analyzer;
>     import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
>     import org.apache.lucene.document.Document;
>     import org.apache.lucene.document.Field;
>     import org.apache.lucene.index.IndexReader;
>     import org.apache.lucene.index.IndexWriter;
>     import org.apache.lucene.index.IndexWriterConfig;
>     import org.apache.lucene.index.Term;
>     import org.apache.lucene.index.TermEnum;
>     import org.apache.lucene.index.TermPositions;
>     import org.apache.lucene.queryParser.standard.StandardQueryParser;
>     import org.apache.lucene.queryParser.standard.config.StandardQueryConfigHandler;
>     import org.apache.lucene.search.IndexSearcher;
>     import org.apache.lucene.search.Query;
>     import org.apache.lucene.search.TopDocs;
>     import org.apache.lucene.store.Directory;
>     import org.apache.lucene.store.RAMDirectory;
>     import org.apache.lucene.util.Version;
>     import org.junit.Test;
>     import static org.hamcrest.Matchers.*;
>     import static org.junit.Assert.*;
>     public class TestJapaneseAnalysis {
>        @Test
>         public void testJapaneseAnalysis() throws Exception {
>             try (Directory directory = new RAMDirectory()) {
>                 Analyzer analyser = new JapaneseAnalyzer(Version.LUCENE_36);
>                 try (IndexWriter writer = new IndexWriter(directory,
> new IndexWriterConfig(Version.LUCENE_36, analyser))) {
>                     Document document = new Document();
>                     document.add(new Field("content", "blah blah
> commercial blah blah \u79CB\u8449\u539F blah blah", Field.Store.NO,
> Field.Index.ANALYZED));
>                     writer.addDocument(document);
>                 }
>                 try (IndexReader reader = IndexReader.open(directory);
>                      TermEnum terms = reader.terms(new Term("content", ""));
>                      TermPositions termPositions = reader.termPositions()) {
>                     do {
>                         Term term = terms.term();
>                         if (term.field() != "content") {
>                             break;
>                         }
>                         System.out.println(term);
>                         termPositions.seek(terms);
>                         while (termPositions.next()) {
>                             System.out.println("  " + termPositions.doc());
>                             int freq = termPositions.freq();
>                             for (int i = 0; i < freq; i++) {
>                                 System.out.println("    " +
> termPositions.nextPosition());
>                             }
>                         }
>                     }
>                     while (terms.next());
>                     StandardQueryParser queryParser = new
> StandardQueryParser(analyser);
> queryParser.setDefaultOperator(StandardQueryConfigHandler.Operator.AND);
>                     // quoted to work around strange behaviour of
> StandardQueryParser treating this as a boolean query.
>                     Query query =
> queryParser.parse("\"\u79CB\u8449\u539F\"", "content");
>                     System.out.println(query);
>                     TopDocs topDocs = new
> IndexSearcher(reader).search(query, 10);
>                     assertThat(topDocs.totalHits, is(1));
>                 }
>             }
>         }
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message