lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carsten Schnober <schno...@ids-mannheim.de>
Subject Reading Payloads
Date Tue, 23 Apr 2013 11:03:49 GMT
Hi,
I'm trying to extract payloads from an index for specific tokens the
following way (inserting sample document number and term):

Terms terms = reader.getTermVector(16504, "term");
TokenStream tokenstream = TokenSources.getTokenStream(terms);
while (tokenstream.incrementToken()) {
  OffsetAttribute offset = tokenstream.getAttribute(OffsetAttribute.class);
  int start = offset.startOffset();
  int end = offset.endOffset();
  String token =
tokenstream.getAttribute(CharTermAttribute.class).toString();

  PayloadAttribute payloadAttr =
tokenstream.addAttribute(PayloadAttribute.class);
  BytesRef payloadBytes = payloadAttr.getPayload();

  ...
}

This works fine for the OffsetAttribute and the CharTermAttribute, but
payloadAttr.getPayload() always returns null for all documents and all
tokens, unfortunately. However, I know that the payloads are stored in
the index as I can retrieve them through a SpanQuery with
Spans.getPayload(). I actually expect every token to carry a payload, as
I'm my custom tokenizer implementation has the following lines:

public class KoraTokenizer extends Tokenizer {
  ...
  private PayloadAttribute payloadAttr =
addAttribute(PayloadAttribute.class);
  ...
  public boolean incrementToken() {
    ...
    payloadAttr.setPayload(new BytesRef(payloadString));
    ...
  }
  ...
}

I've asserted that the payloadString variable is never an empty String
and as I said above, I can retrieve the Payloads with
Spans.getPayload(). So what do I do wrong in my
tokenstream.addAttribute(PayloadAttribute.class) call? BTW, I used
tokenstream.getAttribute() before as for the other attributes but this
obviously threw an IllegalArgumentException so I implemented the
recommendation given in the documentation and replaced it by addAttribute().

Thanks!
Carsten




-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message