lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Woodward (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6034) MemoryIndex should be able to wrap TermVector Terms
Date Mon, 01 Dec 2014 09:37:12 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229589#comment-14229589
] 

Alan Woodward commented on LUCENE-6034:
---------------------------------------

+1, this is a nice cleanup

On the question of what to do if you try and add a TermVectors field with no stored offsets
when the MemoryIndex is expecting them, should we just throw an IllegalArgumentException?
 Better to get an error when you add the field rather than further down the line when you
try and use the offsets.

> MemoryIndex should be able to wrap TermVector Terms
> ---------------------------------------------------
>
>                 Key: LUCENE-6034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6034
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>             Fix For: 5.0
>
>         Attachments: LUCENE-6034.patch, LUCENE-6034.patch
>
>
> The default highlighter has a "WeightedSpanTermExtractor" that uses MemoryIndex for certain
queries -- basically phrases, SpanQueries, and the like.  For lots of text, this aspect of
highlighting is time consuming and consumes a fair amount of memory.  What also consumes memory
is that it wraps the tokenStream in CachingTokenFilter in this case.  But if the underlying
TokenStream is actually from TokenSources (wrapping TermVector Terms), this is all needless!
 Furthermore, MemoryIndex doesn't support payloads.
> The patch here has 3 aspects to it:
> * Internal refactoring to MemoryIndex to simplify it by maintaining the fields in a sorted
state using a TreeMap.  The ramifications of this led to reduced LOC for this file, even with
the other features I added.  It also puts the FieldInfo on the Info, and thus there's one
less data structure to keep around.  I suppose if there are a huge variety of fields in MemoryIndex,
the aggregated N*Log(N) field lookup could add up, but that seems very unlikely.  I also brought
in the MemoryIndexNormDocValues as a simple anonymous inner class - it's super-simple after
all, not worth having in a separate file.
> * New MemoryIndex.addField(String fieldName, Terms) method.  In this case, MemoryIndex
is providing the supporting wrappers around the underlying Terms so that it appears as an
Index.  In so doing, MemoryIndex supports payloads for such fields.
> * WeightedSpanTermExtractor now detects TokenSources' wrapping of Terms and it supplies
this to MemoryIndex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message