lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: Tweaking SOLR memory and cull facet words
Date Fri, 27 Mar 2015 13:38:20 GMT
On 3/27/2015 4:14 AM, phiroc@free.fr wrote:
> Hi,
> 
> my SOLR 5 solrconfig.xml file contains the following lines:
> 
> <!-- Faceting defaults -->
>        <str name="facet">on</str>
>        			<str name="facet.field">text</str>
> 			 <str name="facet.mincount">100</str>
> 
> 
> where the 'text' field contains thousands of words.
> 
> When I start SOLR, the search engine takes several minutes to index the words in the
'text' field (although loading the browse template later only takes a few seconds because
the 'text' field has already been indexed).
> 
> Here are my questions:
> 
> - should I increase SOLR's JVM memory to make initial indexing faster?
> 
> e.g., SOLR_JAVA_MEM="-Xms1024m -Xmx204800m" in solr.in.sh
> 
> - how can I cull facet words according to certain criteria (length, case, etc.)? For
instance, my facets are the following:
> 
>     application (22427)
>     inytapdf0 (22427)
>     pdf (22427)
>     the (22334)
>     new (22131)
>     herald (21983)
>     york (21975)
>     paris (21780)
>     a (21692)
>     and (21298)
>     of (21288)
>     i (21247)
>     in (21062)
>     to (20918)
>     on (20899)
>     m (20857)
>     by (20733)
>     de (20664)
>     for (20580)
>     at (20417)
>     with (20371) 
> ...
> 
> Obviously, words such as "the", "i", "to","m", etc. should not be indexed. Furthermore,
I don't care about "nouns". I am only interested in people and location names.

Starting Solr does not index anything, unless you are talking about one
of the sidecar indexes for spelling correction or suggestions.  You must
send indexing requests to Solr, and if you are experiencing slow
indexing, chances are that it's because of slowness in obtaining data
from the source, not Solr ... or that you are indexing with a single
thread.  If you can set up multiple threads or processes that are
indexing in parallel, it should go faster.

Thousands of terms are not hard for Solr to handle at all.  When the
number of terms gets into the millions or billions, then it starts
becoming a hard problem.

If you use the stopword filter on the index analysis chain for the field
that you are using for facets, then all the stopwords will be removed
from the facets.  That would change how searches work on the field, so
you will probably want to use copyField to create a new field that you
use for faceting.  There are other filters that can do things you have
mentioned, like LengthFilterFactory:

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory

As far as java heap sizing, trial and error is about the only way to
find the right size.

http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

Thanks,
Shawn


Mime
View raw message