lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject extracting charoffsets from SpanWeight's getSpans() in 5.3.1?
Date Tue, 03 Nov 2015 00:26:06 GMT
All,

  I'm trying to find all spans in a given String via stored offsets in Lucene 5.3.1.  I wanted
to use the Highlighter with a NullFragmenter, but that is highlighting only the matching terms,
not the full Spans (related to LUCENE-6796?).

  My Current code iterates through the spans, stores the span positions in one array and gathers
the character offsets via a SpanCollector in a Map<Integer, OffsetAttribute>.  Is there
a simpler way?

Something like this:

String s = "the quick brown fox jumped over the lazy dog";
String field = "f";
Analyzer analyzer = new StandardAnalyzer();

SpanQuery spanQuery = new SpanNearQuery(
        new SpanQuery[] {
                new SpanTermQuery(new Term(field, "fox")),
                new SpanTermQuery(new Term(field, "quick"))
        },
        3,
        false
);


MemoryIndex index = new MemoryIndex(true);


index.addField(field, s, analyzer);
index.freeze();

IndexSearcher searcher = index.createSearcher();
IndexReader reader = searcher.getIndexReader();
spanQuery = (SpanQuery) spanQuery.rewrite(reader);
SpanWeight weight = (SpanWeight) searcher.createWeight(spanQuery, false);
Spans spans = weight.getSpans(reader.leaves().get(0),
        SpanWeight.Postings.OFFSETS);

if (spans == null) {
//do something with full string
     return;
}

OffsetSpanCollector offsetSpanCollector = new OffsetSpanCollector();
List<OffsetAttribute> spanPositions = new ArrayList<>();
while (spans.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
    while (spans.nextStartPosition() != Spans.NO_MORE_POSITIONS) {
        OffsetAttributeImpl offsetAttribute = new OffsetAttributeImpl();
        offsetAttribute.setOffset(spans.startPosition(), spans.endPosition()-1);
        spanPositions.add(offsetAttribute);
        spans.collect(offsetSpanCollector);
    }
}
Map<Integer, OffsetAttribute> charOffsets = offsetSpanCollector.getOffsets();
//now iterate through the list of spanPositions and grab the character offsets for the start
and end tokens of each
//span from the charOffsets
...




private class OffsetSpanCollector implements SpanCollector {
    Map<Integer, Offset> charOffsets = new HashMap<>();

    @Override
    public void collectLeaf(PostingsEnum postingsEnum, int i, Term term) throws IOException
{

        OffsetAttributeImpl offsetAttribute = new OffsetAttributeImpl();
        offsetAttribute.setOffset(postingsEnum.startOffset(), postingsEnum.endOffset());

        charOffsets.put(i, offsetAttribute);
    }

    @Override
    public void reset() {

      //don't think I need to do anything with this?
    }

    public Map<Integer, OffsetAttribute> getOffsets() {
        return charOffsets;
    }
}



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message