lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Schmidt <>
Subject AW: Scaling to large Number of Collections
Date Mon, 01 Sep 2014 07:50:58 GMT
Yes, this would help us in our scenario.

-----Urspr√ľngliche Nachricht-----
Von: Jack Krupansky [] 
Gesendet: Sonntag, 31. August 2014 18:10
Betreff: Re: Scaling to large Number of Collections

We should also consider "lightly-sharded" collections. IOW, even if a cluster has dozens or
a hundred nodes or more, the goal may not be to shard all collections across all shards, which
is fine for the really large collections, but to also support collections which may only need
to be sharded for a few shards or even just a single shard, and to instead focus the attention
on large number of collections rather than heavily-sharded collections.

-- Jack Krupansky

-----Original Message-----
From: Erick Erickson
Sent: Sunday, August 31, 2014 12:04 PM
Subject: Re: Scaling to large Number of Collections

What is your access pattern? By that I mean do all the cores need to be searched at the same
time or is it reasonable for them to be loaded on demand? This latter would impose the penalty
of the first time a collection was accessed there would be a delay while the core loaded.
I suppose I'm asking "how many customers are using the system simultaneously?". One way around
that is to fire a dummy query behind the scenes when a user logs on but before she actually
executes a search.

Why I'm asking:

See this page: It was intended for the multi-tenancy
case in which you could count on a subset of users being logged on at once.

WARNING! LotsOfCores is NOT supported in SolrCloud at this point! There has been some talk
of extending support for SolrCloud, but no action as it's one of those cases that has lots
of implications particularly around ZooKeeper knowing the state of all the cores, cores going
into recovery in a cascading fashionetc. It's not at all clear that it _can_ be extended to
SolrCloud for that matter without doing great violence to the code.

With the LotsOfCores approach (and assuming somebody volunteers to code it up), the number
of cores hosted on a particular node can be many thousands.
The limits will come from how many of them have to be up and running simultaneously. The limits
would come from two places:
1> The time it takes to recursively walk your SOLR_HOME directory and
discover the cores (I see about 1,000 cores/second discovered on my laptop, admittedly an
SSD, and there has been no optimization done to this process).
2> having to keep a table of all the cores and their information (home
directory and the like) in memory, but practically I don't think this is a problem. I haven't
actually measured, but the size of each entry is almost certainly less than 1K and probably
closer to 0.5K.

But it really does bring us back to the question of whether all these cores are necessary
or not. The "usual" technique for handling this with the LotsOfCores option is to combine
the records into a number of smaller cores. Without knowing your requirements in detail, something
like a customers core and a products core where, say, each product has a field with tokens
indicating what users had access or vice versa, and (possibly) using pseudo joins. In one
view, this is an ACL problem which has several solutions, each with drawbacks of course.

Or just de-normalizing your data entirely and just have a core per customer with _all_ the
products indexed in to it.

Like I said, I don't know enough details to have a clue whether the data would explode unacceptably.

Anyway, enough on a Sunday morning!


On Sun, Aug 31, 2014 at 8:18 AM, Shawn Heisey <> wrote:

> On 8/31/2014 8:58 AM, Joseph Obernberger wrote:
> > Could you add another field(s) to your application and use that 
> > instead
> of
> > creating collections/cores?  When you execute a search, instead of
> picking
> > a core, just search a single large core but add in a field which 
> > contains some core ID.
> This is a nice idea.  Have one big collection in your cloud and use an 
> additional field in your queries to filter down to a specific user's data.
> It'd be really nice to write a custom search component that ensures 
> there is a filter query for that specific field, and if it's not 
> present, change the search results to include a document that informs 
> the caller that they're not doing it right.
> (That URL probably won't work correctly on mobile browsers)
> Thanks,
> Shawn

View raw message