lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Shalyminov <ishalymi...@yandex-team.ru>
Subject Re: Indexing documents with multiple field values
Date Fri, 04 Oct 2013 13:20:50 GMT
Hi all!

A little bit more of exploration:)

After indexing with multiple atomic field values, here is what I get:

indexSearcher.doc(0).getFields("gramm")

stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1000>
stored,indexed,tokenized,termVector,omitNorms<gramm:V|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:PR|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:SPRO|0|1000	S|1|0>
stored,indexed,tokenized,termVector,omitNorms<gramm:A|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:SPRO|1|1000>
stored,indexed,tokenized,termVector,omitNorms<gramm:ADV|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:A|1|1>

indexSearcher.doc(0).getField("gramm")

stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1000>

The values are absolutely correct, but why getField() returns only the first one instead of
concatenating them?
If I want to handcraft my custom highlighter, is iterating through (roughly) all the stored
field values supposed to be the right technique? (Previously I was using Alanyzer.tokenStream.incrementToken()
for the entire concatenated field.)

-- 
Igor

02.10.2013, 21:26, "Igor Shalyminov" <ishalyminov@yandex-team.ru>:
> Hi again!
>
> Here is my problem in more detail: in addition to indexing, I need the multi-value field
to be stored as-is. And if I pass it into the analyzer as multiple atomic tokens, it stores
only the first of them.
> What do I need to do to my custom analyzer to make it store all the atomic tokens concatenated
eventually?
>
> --
> Igor
>
> 27.09.2013, 18:12, "Igor Shalyminov" <ishalyminov@yandex-team.ru>:
>
>>  Hello!
>>
>>  I have really long document field values. Tokens of these fields are of the form:
word|payload|position_increment. (I need to control position increments and payload manually.)
>>  I collect these compound tokens for the entire document, then join them with a
'\t', and then pass this string to my custom analyzer.
>>  (For the really long field strings something breaks in the UnicodeUtil.UTF16toUTF8()
with ArrayOutOfBoundsException).
>>
>>  The analyzer is just the following:
>>
>>  class AmbiguousTokenAnalyzer extends Analyzer {
>>      private PayloadEncoder encoder = new IntegerEncoder();
>>
>>      @Override
>>      protected TokenStreamComponents createComponents(String fieldName, Reader
reader) {
>>          Tokenizer source = new DelimiterTokenizer('\t', EngineInfo.ENGINE_VERSION,
reader);
>>          TokenStream sink = new DelimitedPositionIncrementFilter(source,
'|');
>>          sink = new CustomDelimitedPayloadTokenFilter(sink, '|', encoder);
>>          sink.addAttribute(OffsetAttribute.class);
>>          sink.addAttribute(CharTermAttribute.class);
>>          sink.addAttribute(PayloadAttribute.class);
>>          sink.addAttribute(PositionIncrementAttribute.class);
>>          return new TokenStreamComponents(source, sink);
>>      }
>>  }
>>
>>  CustomDelimitedPayloadTokenFilter and DelimitedPositionIncrementFilter have 'incrementToken'
method where the rightmost "|aaa" part of a token is processed.
>>
>>  The field is configured as:
>>          attributeFieldType.setIndexed(true);
>>          attributeFieldType.setStored(true);
>>          attributeFieldType.setOmitNorms(true);
>>          attributeFieldType.setTokenized(true);
>>          attributeFieldType.setStoreTermVectorOffsets(true);
>>          attributeFieldType.setStoreTermVectorPositions(true);
>>          attributeFieldType.setStoreTermVectors(true);
>>          attributeFieldType.setStoreTermVectorPayloads(true);
>>
>>  The problem is, if I pass to the analyzer the field itself (one huge string - via
document.add(...) ), it works OK, but if I pass token after token, something breaks at the
search stage.
>>  As I read somewhere, these two ways must be the same from the resulting index point
of view. Maybe my analyzer misses something?
>>
>>  --
>>  Best Regards,
>>  Igor Shalyminov
>>
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message