lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Webster Homer <webster.ho...@milliporesigma.com>
Subject RE: Query kills Solrcloud
Date Wed, 02 Jan 2019 19:55:04 GMT
We are still having serious problems with our solrcloud failing due to this problem.
The problem is clearly data related. 
How can I determine what documents are being searched? Is it possible to get Solr/lucene to
output the docids being searched?

I believe that this is a lucene bug, but I need to narrow the focus to a smaller number of
records, and I'm not certain how to do that efficiently. Are there debug parameters that could
help?

-----Original Message-----
From: Webster Homer <webster.homer@milliporesigma.com> 
Sent: Thursday, December 20, 2018 3:45 PM
To: solr-user@lucene.apache.org
Subject: Query kills Solrcloud

We are experiencing almost nightly solr crashes due to Japanese queries. I’ve been able
to determine that one of our field types seems to be a culprit. When I run a much reduced
version of the query against out DEV solrcloud I see the memory usage jump from less than
a gb to 5gb using only a single field in the query. The collection is fairly small ~411,000
documents of which only ~25,000 have searchable Japanese fields. I have been able to simplify
the query to run against a single Japanese field in the schema. The JVM memory jumps from
less than a gig to close to 5 gb, and back down. The QTime is 36959 which seems high for 2500
documents. Indeed the single field that I’m using in my test case has 2031 documents.

I extended the query to 5 fields and watch the memory usage in the Solr Console application.
The memory usage goes to almost 6gb with a QTime of 100909. The Solrconsole shows connection
errors, and when I look at the Cloud graph all the replicas on the node where I submitted
the query are down. In dev the replicas eventually recover. In production, with the full query
which has a lot more fields in the qf parameter, the solr cloud dies.
One example query term:
ジエチルアミノヒドロキシベンゾイル安息香酸ヘキシル

This is the field type that we have defined:
   <fieldtype name="text_deep_cjk" class="solr.TextField" positionIncrementGap="10000"
autoGeneratePhraseQueries="false">
     <analyzer type="index">
        <!-- remove spaces between CJK characters -->
       <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])"
replacement="$1"/>
         <tokenizer class="solr.ICUTokenizerFactory" />
        <!-- normalize width before bigram, as e.g. half-width dakuten combine  -->
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- Transform Traditional Han to Simplified Han -->
        <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
                <!-- Transform Hiragana to Katakana just as was done for Endeca -->
        <filter class="solr.ICUTransformFilterFactory" id="Hiragana-Katakana"/>
        <filter class="solr.ICUFoldingFilterFactory"/>   <!-- NFKC, case folding,
diacritics removed -->
        <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true"
hangul="true" outputUnigrams="true" />
      </analyzer>

     <analyzer type="query">
         <!-- remove spaces between CJK characters -->
       <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])"
replacement="$1"/>

       <tokenizer class="solr.ICUTokenizerFactory" />
       <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true" tokenizerFactory="solr.ICUTokenizerFactory" />
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- Transform Traditional Han to Simplified Han -->
        <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
                <!-- Transform Hiragana to Katakana just as was done for Endeca -->
        <filter class="solr.ICUTransformFilterFactory" id="Hiragana-Katakana"/>
        <filter class="solr.ICUFoldingFilterFactory"/>   <!-- NFKC, case folding,
diacritics removed -->
        <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true"
hangul="true" outputUnigrams="true" />
      </analyzer>
    </fieldtype>

Why is searching even 1 field of this type so expensive?
I suspect that this is data related, as other queries return in far less than a second. What
are good strategies for determining what documents are causing the problem? I’m new to debugging
Solr so I could use some help. I’d like to reduce the number of records to a minimum to
create a small dataset to reproduce the problem.
Right now our only option is to stop using this fieldtype, but it does improve the relevancy
of searches that don’t cause Solr to crash.

It would be a great help if the Solrconsole would not timeout on these queries, is there a
way to turn off the timeout?
We are running Solr 7.2

Mime
View raw message