lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4490) TermPositions misses some terms in some cases
Date Thu, 18 Oct 2012 15:48:03 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13479088#comment-13479088
] 

Michael McCandless commented on LUCENE-4490:
--------------------------------------------

I think the problem is 'a' is a stopword and StandardAnalyzer strips stopwords?  Try using
WhitespaceAnalyzer instead?
                
> TermPositions misses some terms in some cases
> ---------------------------------------------
>
>                 Key: LUCENE-4490
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4490
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 3.4, 3.6.1
>            Reporter: Ivan Dimitrov Vasilev
>
> I have the following code:
> public static void main(String[] args) throws Exception {
>         RAMDirectory dir = new RAMDirectory();
>         IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_34, new StandardAnalyzer(Version.LUCENE_34));
>         org.apache.lucene.index.IndexWriter iw = new org.apache.lucene.index.IndexWriter(dir,
iwc);
>         Document doc = new Document();
>         doc.add(new Field("name", "a", Field.Store.YES, Field.Index.ANALYZED_NO_NORMS));
>         iw.addDocument(doc);
> 	iw.close();
>         IndexReader ir = IndexReader.open(dir);
>         Term t = new Term("name", "a");
>         TermPositions tp = ir.termPositions();
>         tp.seek(t);
>         boolean flag = false;
>         while (tp.next()) {
>             System.out.println(tp.doc());
>             flag = true;
>         }
>         if (!flag) { System.out.println("Missing term"); }
> 	System.out.println(ir.document(0));
>         tp.close();
>         ir.close();
> }
> The output is:
> Missing term
> Document<stored,indexed,tokenized,omitNorms<name:a>>
> So the document contains term <name:a> but the TermPositions can not find it.
> When replacing the line:
> doc.add(new Field("name", "a", Field.Store.YES, Field.Index.ANALYZED_NO_NORMS));
> with the line:
> doc.add(new Field("name", "b", Field.Store.YES, Field.Index.ANALYZED_NO_NORMS));
> and line:
> Term t = new Term("name", "a");
> with the line:
> Term t = new Term("name", "b");
> Everything is OK. The output is:
> 0
> Document<stored,indexed,tokenized,omitNorms<name:b>>.
> I did some debugging on it and found that when executing tp.seek(t); when I reached the
line 68 of constructor of SegmentTermEnum:
> size = input.readLong();                    // read the size
> In the case of term <name:b> - the size was assigned 1, while in the case term
<name:a> it was assigned 0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message