lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emir Arnautović <emir.arnauto...@sematext.com>
Subject Re: Faceting Word Count
Date Mon, 06 Nov 2017 12:15:04 GMT
Hi Wael,
You are faceting on analyzed field. This results in field being uninverted - fieldValueCache
being built - on first call after every commit. This is both time and memory consuming (you
can check in admin console in stats how much memory it took). 
What you need to do is to create multivalue string field (not text) and parse values (do analysis
steps) on client side and store it like that. This will allow you to enable docValues on that
field and avoid building fieldValueCache.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 6 Nov 2017, at 13:06, Wael Kader <wael@softech-lb.com> wrote:
> 
> Hi,
> 
> I am using a custom field. Below is the field definition.
> I am using this because I don't want stemming.
> 
> 
>    <fieldType name="text_no_stem2" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
>                protected="protwords.txt"
>                generateWordParts="0"
>                generateNumberParts="1"
>                catenateWords="1"
>                catenateNumbers="1"
>                catenateAll="0"
>                splitOnCaseChange="1"
>                preserveOriginal="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
> <!--ORIGINAL                generateNumberParts="1"-->
>        <filter class="solr.WordDelimiterFilterFactory"
>                protected="protwords.txt"
>                generateWordParts="0"
>                catenateWords="0"
>                catenateNumbers="0"
>                catenateAll="0"
>                splitOnCaseChange="1"
>                preserveOriginal="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <!-- ORIGINAL filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/-->
>        <!-- Webel: switch off Porter-stemmer algorithm to enforce whole
> word match -->
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
> 
> 
> Regards,
> Wael
> 
> On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautović <
> emir.arnautovic@sematext.com> wrote:
> 
>> Hi Wael,
>> Can you provide your field definition and sample query.
>> 
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 6 Nov 2017, at 08:30, Wael Kader <wael@softech-lb.com> wrote:
>>> 
>>> Hello,
>>> 
>>> I am having an index with around 100 Million documents.
>>> I have a multivalued column that I am saving big chunks of text data in.
>> It
>>> has around 20 GB of RAM and 4 CPU's.
>>> 
>>> I was doing faceting on it to get word cloud but it was taking around 1
>>> second to retrieve when the data was 5-10 Million .
>>> Now I have more data and its taking minutes to get the results (that is
>> if
>>> it gets it and SOLR doesn't crash). Whats the best way to make it run or
>>> maybe its not scalable to make it run on my current schema and design
>> with
>>> News articles.
>>> 
>>> I am looking to find the best solution for this. Maybe create another
>> index
>>> to split the data while inserting it or maybe if I change some settings
>> in
>>> SolrConfig or add some RAM, it would perform better.
>>> 
>>> --
>>> Regards,
>>> Wael
>> 
>> 
> 
> 
> -- 
> Regards,
> Wael


Mime
View raw message