lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Woodward <a...@flax.co.uk>
Subject Re: extracting charoffsets from SpanWeight's getSpans() in 5.3.1?
Date Tue, 03 Nov 2015 09:24:54 GMT
The second parameter passed to SpanCollector.collectLeaf() is the position, rather than an
index of any kind, which I think is going to mess things up for you.  But other than that,
you've got the right idea. :-)

Alan Woodward
www.flax.co.uk


On 3 Nov 2015, at 00:26, Allison, Timothy B. wrote:

> All,
> 
>  I'm trying to find all spans in a given String via stored offsets in Lucene 5.3.1. 
I wanted to use the Highlighter with a NullFragmenter, but that is highlighting only the matching
terms, not the full Spans (related to LUCENE-6796?).
> 
>  My Current code iterates through the spans, stores the span positions in one array and
gathers the character offsets via a SpanCollector in a Map<Integer, OffsetAttribute>.
 Is there a simpler way?
> 
> Something like this:
> 
> String s = "the quick brown fox jumped over the lazy dog";
> String field = "f";
> Analyzer analyzer = new StandardAnalyzer();
> 
> SpanQuery spanQuery = new SpanNearQuery(
>        new SpanQuery[] {
>                new SpanTermQuery(new Term(field, "fox")),
>                new SpanTermQuery(new Term(field, "quick"))
>        },
>        3,
>        false
> );
> 
> 
> MemoryIndex index = new MemoryIndex(true);
> 
> 
> index.addField(field, s, analyzer);
> index.freeze();
> 
> IndexSearcher searcher = index.createSearcher();
> IndexReader reader = searcher.getIndexReader();
> spanQuery = (SpanQuery) spanQuery.rewrite(reader);
> SpanWeight weight = (SpanWeight) searcher.createWeight(spanQuery, false);
> Spans spans = weight.getSpans(reader.leaves().get(0),
>        SpanWeight.Postings.OFFSETS);
> 
> if (spans == null) {
> //do something with full string
>     return;
> }
> 
> OffsetSpanCollector offsetSpanCollector = new OffsetSpanCollector();
> List<OffsetAttribute> spanPositions = new ArrayList<>();
> while (spans.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
>    while (spans.nextStartPosition() != Spans.NO_MORE_POSITIONS) {
>        OffsetAttributeImpl offsetAttribute = new OffsetAttributeImpl();
>        offsetAttribute.setOffset(spans.startPosition(), spans.endPosition()-1);
>        spanPositions.add(offsetAttribute);
>        spans.collect(offsetSpanCollector);
>    }
> }
> Map<Integer, OffsetAttribute> charOffsets = offsetSpanCollector.getOffsets();
> //now iterate through the list of spanPositions and grab the character offsets for the
start and end tokens of each
> //span from the charOffsets
> ...
> 
> 
> 
> 
> private class OffsetSpanCollector implements SpanCollector {
>    Map<Integer, Offset> charOffsets = new HashMap<>();
> 
>    @Override
>    public void collectLeaf(PostingsEnum postingsEnum, int i, Term term) throws IOException
{
> 
>        OffsetAttributeImpl offsetAttribute = new OffsetAttributeImpl();
>        offsetAttribute.setOffset(postingsEnum.startOffset(), postingsEnum.endOffset());
> 
>        charOffsets.put(i, offsetAttribute);
>    }
> 
>    @Override
>    public void reset() {
> 
>      //don't think I need to do anything with this?
>    }
> 
>    public Map<Integer, OffsetAttribute> getOffsets() {
>        return charOffsets;
>    }
> }
> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message