lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7639) Use Suffix Arrays for fast search with leading asterisks
Date Fri, 03 Feb 2017 11:42:51 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851384#comment-15851384
] 

Michael McCandless commented on LUCENE-7639:
--------------------------------------------

It would be nice to have a simple option to make heavy (infix, prefix) wildcard queries fast.

I think a custom codec, and likely a custom {{WildcardQuery}} impl that "sees" it is working
with the custom codec and taps into the suffix array, is a good way to implement this.  It
should maybe be a straightforward conversion of the current patch into a custom codec, i.e.
your codec's postings implementation would wrap the default codec and hold onto the {{WildcardHelper}}
instance.

Separately, I am curious how [~dawid.weiss]'s idea (also index the reversed form of the field,
then do two prefix searches and intersect the resulting terms) compares in performance (index
time, index size, search heap, query cost) to the suffix array.

The patch falls back to Java's {{Pattern}} for checking each term in the more complex cases,
but couldn't you just use the {{CompiledAutomaton}}'s {{run}} method to check instead?

I ran Lucene's tests w/ this patch, but first 1) hard-wiring the property check to {{true}},
and 2) making the init synchronous (not using the thread pool), and Lucene's {{WildcardQuery}}
tests hit some failures, e.g.:

{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestTerms -Dtests.method=testTermMinMaxRandom
-Dtests.seed=11FDF2AE5B77A883 -Dtests.locale=es-GT -Dtests.timezone=CET -Dtests.asserts=true
-Dtests.file.encoding=UTF-8
   [junit4] FAILURE 0.04s J0 | TestTerms.testTermMinMaxRandom <<<
   [junit4]    > Throwable #1: java.lang.AssertionError
   [junit4]    > 	at __randomizedtesting.SeedInfo.seed([11FDF2AE5B77A883:5D8E7DDA37EB73E6]:0)
   [junit4]    > 	at org.apache.lucene.util.UnicodeUtil.UTF8toUTF16(UnicodeUtil.java:593)
   [junit4]    > 	at org.apache.lucene.util.BytesRef.utf8ToString(BytesRef.java:152)
   [junit4]    > 	at org.apache.lucene.codecs.blocktree.WildcardHelper.<init>(WildcardHelper.java:106)
   [junit4]    > 	at org.apache.lucene.codecs.blocktree.FieldReader.<init>(FieldReader.java:106)
   [junit4]    > 	at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:234)
   [junit4]    > 	at org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:445)
   [junit4]    > 	at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:109)
   [junit4]    > 	at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74)
   [junit4]    > 	at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:143)
   [junit4]    > 	at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:195)
   [junit4]    > 	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:103)
   [junit4]    > 	at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:473)
   [junit4]    > 	at org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:376)
   [junit4]    > 	at org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:313)
   [junit4]    > 	at org.apache.lucene.index.TestTerms.testTermMinMaxRandom(TestTerms.java:76)
   [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
{noformat}

But those failures a possibly harmless, because those tests are sending non-UTF8 data into
Lucene, whereas this change (the property) would only be enabled on fields that are UTF8.

It also hit a stack overflow w/ a long term:

{noformat}
   [junit4] Suite: org.apache.lucene.index.TestIndexWriter
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriter -Dtests.method=testWickedLongTerm
-Dtests.seed=524558B2F180613F -Dtests.locale=ru-RU -Dtests.timezone=Atlantic/Faroe -Dtests.asserts=true
-Dtests.file.encoding=UTF-8
   [junit4] ERROR   2.47s J1 | TestIndexWriter.testWickedLongTerm <<<
   [junit4]    > Throwable #1: java.lang.StackOverflowError
   [junit4]    > 	at __randomizedtesting.SeedInfo.seed([524558B2F180613F:120594CA57A7BADF]:0)
   [junit4]    > 	at org.apache.lucene.codecs.blocktree.SuffixArrayBytes.sort(SuffixArrayBytes.java:82)
   [junit4]    > 	at org.apache.lucene.codecs.blocktree.SuffixArrayBytes.sort(SuffixArrayBytes.java:84)
   [junit4]    > 	at org.apache.lucene.codecs.blocktree.SuffixArrayBytes.sort(SuffixArrayBytes.java:84)
   [junit4]    > 	at org.apache.lucene.codecs.blocktree.SuffixArrayBytes.sort(SuffixArrayBytes.java:84)
{noformat}

I guess your radix sort implementation is consuming one java stack frame per character in
the term.  Maybe in practice this is OK too.


> Use Suffix Arrays for fast search with leading asterisks
> --------------------------------------------------------
>
>                 Key: LUCENE-7639
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7639
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Yakov Sirotkin
>         Attachments: suffix-array.patch
>
>
> If query term starts with asterisks FST checks all words in the dictionary so request
processing speed falls down. This problem can be solved with Suffix Array approach. Luckily,
Suffix Array can be constructed after Lucene start from existing index. Unfortunately, Suffix
Arrays requires a lot of RAM so we can use it only when special flag is set:
> -Dsolr.suffixArray.enable=true
> It is possible to  speed up Suffix Array initialization using several threads, so we
can control number of threads with 
> -Dsolr.suffixArray.initialization_treads_count=5
> This system property can be omitted, the default value is 5.  
> Attached patch is the suggested implementation for SuffixArray support, it works for
all terms starting with asterisks with at least 3 consequent non-wildcard characters. This
patch do not change search results and  affects only performance issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message