lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gus Heck <gus.h...@gmail.com>
Subject Re: Query kills Solrcloud
Date Wed, 02 Jan 2019 20:11:00 GMT
Are you able to re-index a subset into a new collection?

For control of timeouts I would suggest Postman or curl, or some other
non-browser client.

On Wed, Jan 2, 2019 at 2:55 PM Webster Homer <
webster.homer@milliporesigma.com> wrote:

> We are still having serious problems with our solrcloud failing due to
> this problem.
> The problem is clearly data related.
> How can I determine what documents are being searched? Is it possible to
> get Solr/lucene to output the docids being searched?
>
> I believe that this is a lucene bug, but I need to narrow the focus to a
> smaller number of records, and I'm not certain how to do that efficiently.
> Are there debug parameters that could help?
>
> -----Original Message-----
> From: Webster Homer <webster.homer@milliporesigma.com>
> Sent: Thursday, December 20, 2018 3:45 PM
> To: solr-user@lucene.apache.org
> Subject: Query kills Solrcloud
>
> We are experiencing almost nightly solr crashes due to Japanese queries.
> I’ve been able to determine that one of our field types seems to be a
> culprit. When I run a much reduced version of the query against out DEV
> solrcloud I see the memory usage jump from less than a gb to 5gb using only
> a single field in the query. The collection is fairly small ~411,000
> documents of which only ~25,000 have searchable Japanese fields. I have
> been able to simplify the query to run against a single Japanese field in
> the schema. The JVM memory jumps from less than a gig to close to 5 gb, and
> back down. The QTime is 36959 which seems high for 2500 documents. Indeed
> the single field that I’m using in my test case has 2031 documents.
>
> I extended the query to 5 fields and watch the memory usage in the Solr
> Console application. The memory usage goes to almost 6gb with a QTime of
> 100909. The Solrconsole shows connection errors, and when I look at the
> Cloud graph all the replicas on the node where I submitted the query are
> down. In dev the replicas eventually recover. In production, with the full
> query which has a lot more fields in the qf parameter, the solr cloud dies.
> One example query term:
> ジエチルアミノヒドロキシベンゾイル安息香酸ヘキシル
>
> This is the field type that we have defined:
>    <fieldtype name="text_deep_cjk" class="solr.TextField"
> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>      <analyzer type="index">
>         <!-- remove spaces between CJK characters -->
>        <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])"
> replacement="$1"/>
>          <tokenizer class="solr.ICUTokenizerFactory" />
>         <!-- normalize width before bigram, as e.g. half-width dakuten
> combine  -->
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <!-- Transform Traditional Han to Simplified Han -->
>         <filter class="solr.ICUTransformFilterFactory"
> id="Traditional-Simplified"/>
>                 <!-- Transform Hiragana to Katakana just as was done for
> Endeca -->
>         <filter class="solr.ICUTransformFilterFactory"
> id="Hiragana-Katakana"/>
>         <filter class="solr.ICUFoldingFilterFactory"/>   <!-- NFKC, case
> folding, diacritics removed -->
>         <filter class="solr.CJKBigramFilterFactory" han="true"
> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>       </analyzer>
>
>      <analyzer type="query">
>          <!-- remove spaces between CJK characters -->
>        <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])"
> replacement="$1"/>
>
>        <tokenizer class="solr.ICUTokenizerFactory" />
>        <filter class="solr.SynonymGraphFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"
> tokenizerFactory="solr.ICUTokenizerFactory" />
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <!-- Transform Traditional Han to Simplified Han -->
>         <filter class="solr.ICUTransformFilterFactory"
> id="Traditional-Simplified"/>
>                 <!-- Transform Hiragana to Katakana just as was done for
> Endeca -->
>         <filter class="solr.ICUTransformFilterFactory"
> id="Hiragana-Katakana"/>
>         <filter class="solr.ICUFoldingFilterFactory"/>   <!-- NFKC, case
> folding, diacritics removed -->
>         <filter class="solr.CJKBigramFilterFactory" han="true"
> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>       </analyzer>
>     </fieldtype>
>
> Why is searching even 1 field of this type so expensive?
> I suspect that this is data related, as other queries return in far less
> than a second. What are good strategies for determining what documents are
> causing the problem? I’m new to debugging Solr so I could use some help.
> I’d like to reduce the number of records to a minimum to create a small
> dataset to reproduce the problem.
> Right now our only option is to stop using this fieldtype, but it does
> improve the relevancy of searches that don’t cause Solr to crash.
>
> It would be a great help if the Solrconsole would not timeout on these
> queries, is there a way to turn off the timeout?
> We are running Solr 7.2
>


-- 
http://www.the111shift.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message