lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: Very high memory and CPU utilization.
Date Tue, 03 Nov 2015 06:34:34 GMT
One rule of thumb for Solr is to shard after you reach 100 million documents. With large documents,
you might want to shard sooner.

We are running an unsharded index of 7 million documents (55GB) without problems.

The EdgeNgramFilter generates a set of prefix terms for each term in the document. For the
term “secondary”, it would generate:

s
se
sec
seco
secon
second
seconda
secondar
secondary

Obviously, this makes the index larger. But it makes prefix match a simple lookup, without
needing wildcards.

Again, we can help you more if you describe what you are trying to do.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 2, 2015, at 9:39 PM, Modassar Ather <modather1981@gmail.com> wrote:
> 
> Thanks Walter for your response,
> 
> It is around 90GB of index (around 8 million documents) on one shard and
> there are 12 such shards. As per my understanding the sharding is required
> for this case. Please help me understand if it is not required.
> 
> We have requirements where we need full wild card support to be provided to
> our users.
> I will try using EdgeNgramFilter. Can you please help me understand if
> EdgeNgramFilter can be a replacement of wild cards?
> There are situations where the words may be extended with some special
> characters e.g. For se* there can be a match secondry-school which also
> needs to be considered.
> 
> Regards,
> Modassar
> 
> 
> 
> On Mon, Nov 2, 2015 at 10:17 PM, Walter Underwood <wunder@wunderwood.org>
> wrote:
> 
>> To back up a bit, how many documents are in this 90GB index? You might not
>> need to shard at all.
>> 
>> Why are you sending a query with a trailing wildcard? Are you matching the
>> prefix of words, for query completion? If so, look at the suggester, which
>> is designed to solve exactly that. Or you can use the EdgeNgramFilter to
>> index prefixes. That will make your index larger, but prefix searches will
>> be very fast.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Nov 2, 2015, at 5:17 AM, Toke Eskildsen <te@statsbiblioteket.dk>
>> wrote:
>>> 
>>> On Mon, 2015-11-02 at 17:27 +0530, Modassar Ather wrote:
>>> 
>>>> The query q=network se* is quick enough in our system too. It takes
>>>> around 3-4 seconds for around 8 million records.
>>>> 
>>>> The problem is with the same query as phrase. q="network se*".
>>> 
>>> I misunderstood your query then. I tried replicating it with
>>> q="der se*"
>>> 
>>> http://rosalind:52300/solr/collection1/select?q=%22der+se*%
>>> 22&wt=json&indent=true&facet=false&group=true&group.field=domain
>>> 
>>> gets expanded to
>>> 
>>> parsedquery": "(+DisjunctionMaxQuery((content_text:\"kan svane\" |
>>> author:kan svane* | text:\"kan svane\" | title:\"kan svane\" | url:kan
>>> svane* | description:\"kan svane\")) ())/no_coord"
>>> 
>>> The result was 1,043,258,271 hits in 15,211 ms
>>> 
>>> 
>>> Interestingly enough, a search for
>>> q="kan svane*"
>>> resulted in 711 hits in 12,470 ms. Maybe because 'kan' alone matches 1
>>> billion+ documents. On that note,
>>> q=se*
>>> resulted in -951812427 hits in 194,276 ms.
>>> 
>>> Now this is interesting. The negative number seems to be caused by
>>> grouping, but I finally got the response time up in the minutes. Still
>>> no memory problems though. Hits without grouping were 3,343,154,869.
>>> 
>>> For comparison,
>>> q=http
>>> resulted in -1527418054 hits in 87,464 ms. Without grouping the hit
>>> count was 7,062,516,538. Twice the hits of 'se*' in half the time.
>>> 
>>>> I changed my SolrCloud setup from 12 shard to 8 shard and given each
>>>> shard 30 GB of RAM on the same machine with same index size
>>>> (re-indexed) but could not see the significant improvement for the
>>>> query given.
>>> 
>>> Strange. I would have expected the extra free memory for disk space to
>>> help performance.
>>> 
>>>> Also can you please share your experiences with respect to RAM, GC,
>>>> solr cache setup etc as it seems by your comment that the SolrCloud
>>>> environment you have is kind of similar to the one I work on?
>>>> 
>>> There is a short write up at
>>> https://sbdevel.wordpress.com/net-archive-search/
>>> 
>>> - Toke Eskildsen, State and University Library, Denmark
>>> 
>>> 
>>> 
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message