lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Brady <james.colin.br...@gmail.com>
Subject Re: Strategy for handling large (and growing) index: horizontal partitioning?
Date Tue, 04 Mar 2008 04:51:50 GMT
Hi Kevin,
Thanks for your suggestions - I've got about 6 million, and am being  
quite stingy with my schema at the moment I'm afraid.

If anything, the size of each document is going to go up, not down,  
but I might be able to prune some older, unused data.

James

On 3 Mar 2008, at 14:33, Kevin Lewandowski wrote:

> How many documents are in the index?
>
> If you haven't already done this I'd take a really close look at your
> schema and make sure you're only storing the things that should really
> be stored, same with the indexed fields. I drastically reduced my
> index size just by changing some indexed/stored options on a few
> fields.
>
> On Thu, Feb 28, 2008 at 10:54 PM, Otis Gospodnetic
> <otis_gospodnetic@yahoo.com> wrote:
>> James,
>>
>>  I can't comment more on the SN's arch choices.
>>
>>  Here is the story about your questions
>>  - 1 Solr instance can hold 1+ indices, either via JNDI (see Wiki)  
>> or via the new multi-core support which works, but is still being  
>> hacked on
>>  - See SOLR-303 in JIRA for distributed search.  Yonik committed  
>> it just the other day, so now that's in nightly builds if you want  
>> to try it.  There are 2 Wiki pages about that, too, see Recent  
>> changes log on the Wiki to quickly find them.
>>
>>
>>  Otis
>>  --
>>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>  ----- Original Message ----
>>> From: James Brady <james.colin.brady@gmail.com>
>>> To: solr-user@lucene.apache.org
>>
>>
>>> Sent: Friday, February 29, 2008 1:11:07 AM
>>> Subject: Re: Strategy for handling large (and growing) index:  
>>> horizontal partitioning?
>>>
>>> Hi Otis,
>>> Thanks for your comments -- I didn't realise the wiki is open to
>>> editing; my apologies. I've put in a few words to try and clear
>>> things up a bit.
>>>
>>> So determining n will probably be a best guess followed by trial and
>>> error, that's fine. I'm still not clear about whether single Solr
>>> servers can operate across several indices, however.. can anyone  
>>> give
>>> me some pointers here?
>>> An alternative would be to have 1 index per instance, and n  
>>> instances
>>> per server, where n is small. This might actually be a practical
>>> solution -- I'm spending ~20% of my time committing, so I should
>>> probably only have 3 or 4 indices in total per server to avoid two
>>> committing at the same time.
>>>
>>> Your mention of The Large Social Network was interesting! A social
>>> network's data is by definition pretty poorly partitioned by user  
>>> id,
>>> so unless they've done something extremely clever like co-locating
>>> social cliques in the same indices, I would have though it would  
>>> be a
>>> sub-optimal architecture. If me and my friends are scattered around
>>> different indices, each search would have to be federated massively.
>>>
>>> James
>>>
>>>
>>> On 28 Feb 2008, at 20:49, Otis Gospodnetic wrote:
>>>
>>>> James,
>>>>
>>>> Regarding your questions about n users per index - this is a fine
>>>> approach.  The largest Social Network that you know of uses the
>>>> same approach for various things, including full-text indices (not
>>>> Solr, but close).  You'd have to maintain user->shard/index mapping
>>>> somewhere, of course.  What should the n be, you ask?  Look at the
>>>> overall index size, I'd say, against server capabilities (RAM,
>>>> disk, CPU), increase n up to a point where you're maximizing your
>>>> hardware at some target query rate.
>>>>
>>>> Otis
>>>> --
>>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>>
>>>> ----- Original Message ----
>>>>> From: James Brady
>>
>>
>>>>> To: solr-user@lucene.apache.org
>>>>> Sent: Wednesday, February 27, 2008 10:08:02 PM
>>>>> Subject: Strategy for handling large (and growing) index:
>>>>> horizontal partitioning?
>>>>>
>>>>> Hi all,
>>>>> Our current setup is a master and slave pair on a single machine,
>>>>> with an index size of ~50GB.
>>>>>
>>>>> Query and update times are still respectable, but commits are  
>>>>> taking
>>>>> ~20% of time on the master, while our daily index optimise can  
>>>>> up to
>>>>> 4 hours...
>>>>> Here's the most relevant part of solrconfig.xml:
>>>>>      true
>>>>>      10
>>>>>      1000
>>>>>      10000
>>>>>      10000
>>>>>
>>>>> I've given both master and slave 2.5GB of RAM.
>>>>>
>>>>> Does an index optimise read and re-write the whole thing? If so,
>>>>> taking about 4 hours is pretty good! However, the documentation  
>>>>> here:
>>>>> http://wiki.apache.org/solr/CollectionDistribution?highlight=% 
>>>>> 28ten
>>>>> +minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b
>>>>> states "Optimizations can take nearly ten minutes to run..." which
>>>>> leads me to think that we've grossly misconfigured something...
>>>>>
>>>>> Firstly, we would obviously love any way to reduce this  
>>>>> optimise time
>>>>> - I have yet to experiment extensively with the settings above,  
>>>>> and
>>>>> optimise frequency, but some general guidance would be great.
>>>>>
>>>>> Secondly, this index size is increasing monotonously over time  
>>>>> and as
>>>>> we acquire new users. We need to take action to ensure we can  
>>>>> scale
>>>>> in the future. The approach we're favouring at the moment is
>>>>> horizontal partitioning of indices by user id as our data suits  
>>>>> this
>>>>> scheme well. A given index would hold the indexed data for n  
>>>>> users,
>>>>> where n would probably be between 1 and 100 users, and we will  
>>>>> have
>>>>> multiple indices per search server.
>>>>>
>>>>> Running server per index is impractical, especially for a small  
>>>>> n, so
>>>>> is a sinlge Solr instance capable of managing multiple  
>>>>> searchers and
>>>>> writers in this way? Following on from that, does anyone know of
>>>>> limiting factors in Solr or Lucene that would influence our  
>>>>> decision
>>>>> on the value of n - the number of users per index?
>>>>>
>>>>> Thanks!
>>>>> James
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>


Mime
View raw message